The purpose of this script is to perform gene expression analysis following pseudotime trajectory analysis on single-cell RNA-seq data of T-cell populations within the normal canine thymus and lymph node, to track how gene expression changes as T cells progress through development.
This script was adapted from the Monocle 3 documentation (https://cole-trapnell-lab.github.io/monocle3/docs/trajectories/)
library(Seurat)
library(knitr)
library(tidyverse)
library(patchwork)
library(monocle3)
library(SeuratWrappers)
library(pheatmap)
library(stringr)
library(RColorBrewer)
Input: A Seurat object of T-cell subsets from normal canine lymph node and thymus.
setwd("C:/Users/edlarsen/Documents/240828_scRNAseq/Pseudotime")
seu <- readRDS(file = "C:/Users/edlarsen/Documents/240828_scRNAseq/T_Cells/IntegThymAndLN_Annotated.RData")
# convert seurat to cds
cds <- as.cell_data_set(seu, assay = "RNA")
cds <- estimate_size_factors(cds)
fData(cds)$gene_short_name <- rownames(fData(cds))
# preprocessing and trajectory analysis with Monocle 3
cds <- cluster_cells(cds, reduction_method = "UMAP")
colData(cds)$assigned_cell_type <- as.character(clusters(cds))
colData(cds)$assigned_cell_type <- dplyr::recode(colData(cds)$assigned_cell_type,
"1" = "DP_Thymocytes_2",
"2" = "DP_Thymocytes_1",
"3" = "Naive_to_Activated_T",
"4" = "Naive_T_2",
"5" = "SP_Thym_and_Naive_T",
"6" = "Naive_T_1",
"7" = "Activated_T",
"8" = "SP_Thym",
"9" = "Proliferating_DP_Thym",
"10" = "CD8_NK",
"11" = "ETP_DN_ISP_Thym",
"12" = "Late_SP_Thym")
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds, reduction_method = "UMAP")
cds <- learn_graph(cds, use_partition = F)
cds <- order_cells(cds, reduction_method = "UMAP", root_cells = colnames(cds[, clusters(cds) == 10]))
cds$monocle3_pseudotime <- pseudotime(cds)
Plot the top gene for each cluster by pseudo R-squared value, a specificity metric ranging from 0 to 1.
marker_test_res <- top_markers(cds, group_cells_by = 'assigned_cell_type', reference_cells = 1000)
write.csv(marker_test_res, file = "topMarkers_Monocle3_clusters.csv")
top_specific_markers <- marker_test_res %>%
# expressed by at least 85% of cells in the cluster
filter(fraction_expressing >= 0.85) %>%
group_by(cell_group) %>%
# top 3 genes by pseudo R-squared value for specificity
top_n(3, pseudo_R2)
top_specific_markers_ids <- unique(top_specific_markers %>% pull(gene_id))
plot_genes_by_group(cds,
top_specific_markers_ids,
group_cells_by = "assigned_cell_type",
ordering_type = "cluster_row_col",
max.size = 5)
About these genes:
SKAP1 encodes a T-cell adapter protein that enables SH2 binding to promote TCR signaling.
FYN is an Src protein kinase that is essential for T-cell activation. (https://www.pnas.org/doi/full/10.1073/pnas.0406168101).
ENSCAFG00000025128 in CanFam3.1 is an un-annotated gene in the region of the J genes of the TCR.
PARP8 is one of a member of PARP enzymes involved in DNA repair.
CBLB promotes clearance of the TCR from the cell surface as a way to mitigate the T-cell response (https://pubmed.ncbi.nlm.nih.gov/12415267/).
INPP4B is an inhibitor of PI3K/AKT signaling.
BACH2 is a transcription factor that regulates IL-2 expression (https://pubmed.ncbi.nlm.nih.gov/18769450/) and restrains terminal differentiation to enable generation of memory cells (https://pubmed.ncbi.nlm.nih.gov/27158840/).
MT-CO2 is a mitochondrial gene encoding a subunit of cytochrome c oxidase.
LTB = lymphotoxin beta; expressed by activated T cells.
DLA-64 is an MHC class I gene.
GIMAP family proteins play a variety of roles in T-cell development and survival. GIMAP4 is expressed in developing T cells at the DN4 stage in response to pre*TCR signaling, is transiently downregulated in the DP stage, and re-expressed in SP thymocytes and peripheral T cells. (https://pubmed.ncbi.nlm.nih.gov/16569770/).
RPL13 is a ribosomal protein shown to play a role in antiviral immune responses (https://pmc.ncbi.nlm.nih.gov/articles/PMC8173215/).
CD79A is supposed to be specific for B cells.
RPS28 is a ribosomal protein that plays a role in MHC class I peptide generation.
B2M is a component of the MHC class I molecule.
S100A8 is a gene associated with cytotoxic immune responses. It has also been shown to have regulatory activity against T cells by inhibiting B7 expression to reduce antigen presentation and negatively regulating leukocyte adhesion and transmigration (https://pubmed.ncbi.nlm.nih.gov/29942307/)
ENSCAFG00000007461 is the CD8B gene.
ARPP21 is a thymocyte-specific RNA-binding protein that provides TCR repertoire diversity by binding to the 3’ UTR and promoting RAG1 mRNA expression (https://pubmed.ncbi.nlm.nih.gov/38467629/).
PDE4D is a phosphodiesterase that regulates TCR signaling by attenuating the negative constraint of cAMP (https://pubmed.ncbi.nlm.nih.gov/17404263/). One study from the 90s found that PDE4 is the dominant PDE species involved in determining the metabolism of cAMP in thymocytes (https://pubmed.ncbi.nlm.nih.gov/8730511/).
The role of PDE1C in thymocytes/T cells is unclear.
KNL1 functions as a scaffold for spindle assembly checkpoint proteins during mitosis.
DIAPH3 is involved in actin remodeling for moderating cell movement and adhesion.
MKI67 is a marker of cell proliferation.
NDST3 is involved in heparan sulfate metabolism.
CDK6 regulates the G1/S transition of the cell cycle.
TFDP2 is a transcription factor that dimerizes with E2F for cell cycle progression.
ENSCAFG00000041010 is a TCR gamma C region gene.
ENSCAFG00000011098 is a TCR delta C region gene.
BCL2 is an anti*apoptotic gene.
The graph_test()
function uses a statistic from spatial
autocorrelation analysis (Moran’s I) to find genes that vary
between groups of cells in UMAP space. A Moran’s I value of 0
indicates no effect, while +1 indicates perfect positive autocorrelation
and suggests that nearby cells have very similar values of a gene’s
expression. Significant values much less than zero are generally rare.
Positive values indicate a gene is expressed in a focal region of the
UMAP space (i.e., specific to one or more clusters).
deg <- graph_test(cds, neighbor_graph = "principal_graph") # the data frame 'deg' will contain the Moran's I test results for each gene in cds.
deg_ids <- row.names(subset(deg, q_value < 0.05))
write.csv(deg, file="monocle3_pseudotimeGeneExpressionAnalysis.csv")
Plot markers of interest that varied the most across pseudotime or were one of the top markers for a significant stage of thymocyte and T-cell development.
cds <- cluster_cells(cds, reduction_method = "UMAP")
cds <- learn_graph(cds, use_partition = F)
cds <- order_cells(cds, reduction_method = "UMAP", root_cells = colnames(cds[, clusters(cds) == 10]))
genes <- c("PAK3", "FBN2", "CDH1", "DPP4", "TBC1D9", "MRC1", "BAHCC1", "ALPL", "PTPRF", "CHN2", "NEDD4L", "CD1C", "ADAM12", "DHDDS", "DNTT", "ARPP21", "MAML3", "FSTL4", "INPP4B", "SCML4", "TOX2", "BCL2", "RIPOR2", "TGFBR2", "FAU", "RPL13", "RPL19", "RPS11", "RPS15A", "SELL", "LTB", "SCML4", "DLA-DRA", "TCF12", "RAG1", "FLT3", "CCR9", "CD44", "GATA3", "NOTCH1", "NOTCH3", "SATB1")
for (gene in genes){
p <- plot_cells(cds,
genes=gene,
label_cell_groups = FALSE,
show_trajectory_graph = FALSE) &
scale_color_gradientn(colours = rev(brewer.pal(name="Spectral", n=11)))
print(p)
}
for (gene in genes){
cds_subset <- cds[rowData(cds)$gene_short_name == gene]
cds_subset <- order_cells(cds_subset, reduction_method = "UMAP", root_cells = colnames(cds[, clusters(cds) == 10]))
p <- plot_genes_in_pseudotime(cds_subset, color_cells_by = "assigned_cell_type") + theme(text = element_text(size=20)) + guides(color = guide_legend(override.aes = list(size=5)))
print(p)
}
plot_genes_by_group(cds,
genes,
group_cells_by = "assigned_cell_type",
ordering_type = "cluster_row_col",
max.size = 5)
find_gene_modules()
runs UMAP on genes (as opposed to
cells) to group them based on their Moran’s I value into
modules using Louvain community analysis.
This is required for find_gene_modules to work; see https://github.com/cole-trapnell-lab/monocle3/issues/623 and https://github.com/cole-trapnell-lab/monocle3/issues/655.
cds <- preprocess_cds(cds)
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds, reduction_method = "UMAP", resolution = 1e-4)
p1 <- plot_cells(cds,
color_cells_by = "assigned_cell_type",
label_groups_by_cluster = T,
label_branch_points = T,
show_trajectory_graph = F,
group_label_size = 5) + ggtitle("Assigned Cell Types from Seurat Clusters") + theme(legend.position = "right")
p2 <- plot_cells(cds,
group_label_size = 5,
show_trajectory_graph = FALSE) + ggtitle("Monocle3 Cluster IDs")
p1+p2
Visualize which gene modules are expressed in which clusters.
gene_module_df <- find_gene_modules(cds[deg_ids, ], resolution=1e-2) # creates data frame that contains a row for each gene and identifies the module it belongs to
write.csv(gene_module_df, file = "gene_module_df.csv")
kable(head(gene_module_df, n = 20))
id | module | supermodule | dim_1 | dim_2 |
---|---|---|---|---|
ZNF157 | 44 | 1 | 0.4639565 | 0.5635854 |
EMD | 25 | 1 | -3.4970799 | -1.2860831 |
SMC1A | 41 | 1 | 1.2785526 | -3.4609367 |
CENPI | 41 | 1 | 1.2040056 | -4.2394809 |
AFF2 | 74 | 1 | 0.0797271 | 2.9250151 |
ZNF41 | 18 | 1 | -1.6303992 | 1.7328247 |
FLNA | 27 | 1 | -3.5133209 | 0.6423115 |
MORC4 | 36 | 1 | 4.3998816 | -0.2569925 |
ENSCAFG00000013554 | 64 | 1 | -5.1286199 | -2.2687923 |
ARAF | 40 | 1 | 2.8898923 | 0.6751892 |
RBM41 | 2 | 1 | 4.6193874 | 1.1333959 |
TIMM8A | 33 | 1 | -1.8862363 | -0.9665346 |
SYN1 | 71 | 2 | -5.8734577 | 2.9314056 |
NUP62CL | 41 | 1 | 1.4067100 | -4.0201897 |
TIMP1 | 19 | 1 | -3.0838256 | 4.0108839 |
BTK | 62 | 2 | -6.1455238 | 4.2697926 |
ENSCAFG00000045333 | 29 | 1 | -4.5652788 | -1.9048082 |
PRPS1 | 56 | 1 | -1.0661348 | -3.4851185 |
IDS | 19 | 1 | -3.3845995 | 3.5496195 |
ELK1 | 60 | 1 | 0.6169909 | 1.7827289 |
# table aggregating expression of all genes in each module across all Monocle3 clusters
m3_cell_group_df <- tibble::tibble(
cell=row.names(colData(cds)),
cell_group=clusters(cds)[colnames(cds)])
agg_mat_m3 <- aggregate_gene_expression(cds, gene_module_df, m3_cell_group_df)
row.names(agg_mat_m3) <- stringr::str_c("Module ", row.names(agg_mat_m3))
colnames(agg_mat_m3) <- stringr::str_c("Cluster ", colnames(agg_mat_m3))
write.csv(agg_mat_m3, file = "GeneModuleAssignment_Monocle3Clusters.csv")
# table aggregating expression of all genes in each module across all cell type assignments from our Seurat object
ann_cell_group_df <- tibble::tibble(
cell = colnames(cds),
cell_group = colData(cds)$assigned_cell_type
)
agg_mat_ann <- aggregate_gene_expression(cds, gene_module_df, ann_cell_group_df)
row.names(agg_mat_ann) <- stringr::str_c("Module ", row.names(agg_mat_ann))
colnames(agg_mat_ann) <- stringr::str_c(unique(ann_cell_group_df$cell_group))
write.csv(agg_mat_ann, file = "GeneModuleAssignment_SeuratAnnotations.csv")
# heatmaps
pheatmap::pheatmap(agg_mat_m3,
cluster_rows = TRUE,
cluster_cols = TRUE,
scale = "column",
clustering_method = "ward.D2",
fontsize = 10,
main = "Expression of Gene Modules Across Monocle3 Clusters")
pheatmap::pheatmap(agg_mat_ann,
cluster_rows = TRUE,
cluster_cols = TRUE,
scale = "column",
clustering_method = "ward.D2",
fontsize = 10,
main = "Expression of Gene Modules Across Assigned Cell Types")
sessionInfo()
## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Denver
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 pheatmap_1.0.12
## [3] SeuratWrappers_0.4.0 monocle3_1.3.7
## [5] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [7] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
## [9] IRanges_2.38.1 S4Vectors_0.42.1
## [11] MatrixGenerics_1.16.0 matrixStats_1.5.0
## [13] Biobase_2.64.0 BiocGenerics_0.50.0
## [15] patchwork_1.3.0 lubridate_1.9.4
## [17] forcats_1.0.0 stringr_1.5.1
## [19] dplyr_1.1.4 purrr_1.0.2
## [21] readr_2.1.5 tidyr_1.3.1
## [23] tibble_3.2.1 ggplot2_3.5.1
## [25] tidyverse_2.0.0 knitr_1.49
## [27] Seurat_5.2.1 SeuratObject_5.0.2
## [29] sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.0
## [3] later_1.4.1 R.oo_1.27.0
## [5] polyclip_1.10-7 fastDummies_1.7.5
## [7] lifecycle_1.0.4 sf_1.0-19
## [9] Rdpack_2.6.2 pbmcapply_1.5.1
## [11] globals_0.16.3 lattice_0.22-6
## [13] MASS_7.3-60.2 magrittr_2.0.3
## [15] speedglm_0.3-5 plotly_4.10.4
## [17] sass_0.4.9 rmarkdown_2.29
## [19] jquerylib_0.1.4 yaml_2.3.10
## [21] remotes_2.5.0 httpuv_1.6.15
## [23] sctransform_0.4.1 spam_2.11-1
## [25] spatstat.sparse_3.1-0 reticulate_1.40.0
## [27] DBI_1.2.3 cowplot_1.1.3
## [29] pbapply_1.7-2 minqa_1.2.8
## [31] abind_1.4-8 zlibbioc_1.50.0
## [33] Rtsne_0.17 R.utils_2.12.3
## [35] GenomeInfoDbData_1.2.12 ggrepel_0.9.6
## [37] irlba_2.3.5.1 listenv_0.9.1
## [39] spatstat.utils_3.1-2 units_0.8-5
## [41] goftest_1.2-3 RSpectra_0.16-2
## [43] spatstat.random_3.3-2 fitdistrplus_1.2-2
## [45] parallelly_1.42.0 DelayedMatrixStats_1.26.0
## [47] codetools_0.2-20 DelayedArray_0.30.1
## [49] tidyselect_1.2.1 UCSC.utils_1.0.0
## [51] farver_2.1.2 viridis_0.6.5
## [53] lme4_1.1-36 spatstat.explore_3.3-4
## [55] jsonlite_1.8.9 e1071_1.7-16
## [57] progressr_0.15.1 ggridges_0.5.6
## [59] survival_3.5-8 tools_4.4.0
## [61] ica_1.0-3 Rcpp_1.0.14
## [63] glue_1.8.0 gridExtra_2.3
## [65] SparseArray_1.4.8 xfun_0.49
## [67] withr_3.0.2 BiocManager_1.30.25
## [69] fastmap_1.2.0 boot_1.3-30
## [71] spData_2.3.4 digest_0.6.35
## [73] rsvd_1.0.5 timechange_0.3.0
## [75] R6_2.5.1 mime_0.12
## [77] wk_0.9.4 colorspace_2.1-1
## [79] scattermore_1.2 tensor_1.5
## [81] spatstat.data_3.1-4 R.methodsS3_1.8.2
## [83] RhpcBLASctl_0.23-42 generics_0.1.3
## [85] data.table_1.16.4 class_7.3-22
## [87] httr_1.4.7 htmlwidgets_1.6.4
## [89] S4Arrays_1.4.1 spdep_1.3-10
## [91] uwot_0.2.2 pkgconfig_2.0.3
## [93] gtable_0.3.6 lmtest_0.9-40
## [95] XVector_0.44.0 htmltools_0.5.8.1
## [97] dotCall64_1.2 scales_1.3.0
## [99] png_0.1-8 spatstat.univar_3.1-1
## [101] reformulas_0.4.0 rstudioapi_0.17.1
## [103] tzdb_0.4.0 reshape2_1.4.4
## [105] nlme_3.1-164 nloptr_2.1.1
## [107] proxy_0.4-27 cachem_1.1.0
## [109] zoo_1.8-12 KernSmooth_2.23-22
## [111] parallel_4.4.0 miniUI_0.1.1.1
## [113] s2_1.1.7 pillar_1.10.1
## [115] grid_4.4.0 vctrs_0.6.5
## [117] RANN_2.6.2 slam_0.1-55
## [119] promises_1.3.2 xtable_1.8-4
## [121] cluster_2.1.6 evaluate_1.0.3
## [123] cli_3.6.2 compiler_4.4.0
## [125] rlang_1.1.3 crayon_1.5.3
## [127] grr_0.9.5 leidenbase_0.1.31
## [129] future.apply_1.11.3 labeling_0.4.3
## [131] classInt_0.4-11 plyr_1.8.9
## [133] stringi_1.8.4 viridisLite_0.4.2
## [135] deldir_2.0-4 assertthat_0.2.1
## [137] munsell_0.5.1 lazyeval_0.2.2
## [139] spatstat.geom_3.3-5 Matrix_1.7-0
## [141] RcppHNSW_0.6.0 hms_1.1.3
## [143] sparseMatrixStats_1.16.0 future_1.34.0
## [145] shiny_1.10.0 rbibutils_2.3
## [147] ROCR_1.0-11 igraph_2.1.4
## [149] bslib_0.8.0 biglm_0.9-3
citation()
## To cite R in publications use:
##
## R Core Team (2024). _R: A Language and Environment for Statistical
## Computing_. R Foundation for Statistical Computing, Vienna, Austria.
## <https://www.R-project.org/>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {R: A Language and Environment for Statistical Computing},
## author = {{R Core Team}},
## organization = {R Foundation for Statistical Computing},
## address = {Vienna, Austria},
## year = {2024},
## url = {https://www.R-project.org/},
## }
##
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.