Introduction

The purpose of this script is to perform single-cell RNA-seq data normalization and clustering with Seurat. This script follows the standard pre-processing workflow as described in the the Seurat documentation https://satijalab.org/seurat/articles/pbmc3k_tutorial#normalizing-the-data.

Software

library(Seurat)
library(tidyverse)
library(patchwork)
library(pheatmap)
library(RColorBrewer)
library(knitr)
library(clustree)
library(presto)

Data

Input: Seurat objects that have been filtered to remove low quality cells +/- doublets, and multiple samples from the same location integrated into one object.

setwd("/scratch/alpine/edlarsen@colostate.edu/project_scrna_01/240828_scRNAseq/Normalization_and_Clustering")
integ_thym <- readRDS(file = "seurat_integrated_THYM_filtered.RData")

Data normalization

The global-scaling normalization method “LogNormalize” normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000), and log-transforms the result.

# thymus
normalized_thym <- NormalizeData(integ_thym,
                                 normalization.method = "LogNormalize",
                                 scale.factor = 10000)

Identification of highly variable features (feature selection)

Calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e., highly expressed in some cells and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets. Seurat’s procedure involves directly modeling the mean-variance relationship inherent in single-cell data and returning 2,000 features per dataset by default that can be used in downstream analysis like PCA. Reference for Seurat’s procedure: https://www.sciencedirect.com/science/article/pii/S0092867419305598?via%3Dihub.

normalized_thym <- FindVariableFeatures(normalized_thym) # thymus

# Identify the 10 most highly variable genes
top10_thym <- head(VariableFeatures(normalized_thym), 10)

plot3 <- VariableFeaturePlot(normalized_thym)

plot4 <- LabelPoints(
  plot = plot3,
  points = top10_thym,
  repel = TRUE
) +
  labs(title = "Top 10 Most Variable Features:\n Canine Thymus")

plot4 

Scale data

Apply a linear transformation that is a standard pre-processing step prior to dimensional reduction techniques like PCA. This transformation works by shifting the expression of each gene, so that the mean expression across cells is 0. Then it scales the expression of each gene, so that the variance across cells is 1 (this step gives equal weight in downstream analyses, so that highly expressed genes do not dominate).

# By default, only variable features are scaled. To evaluate all genes and not just variable features: ScaleData(data, features = all.genes)
normalized_thym <- ScaleData(normalized_thym)

Perform linear dimensional reduction

Each PC of a PCA score essentially represents a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset.

# By default, the previously determined variable features are used as input, but can be thym_defined using the 'features' argument; if you want to do a custom set of features, make sure they are passed to ScaleData() first.

normalized_thym <- RunPCA(normalized_thym)

VizDimLoadings(normalized_thym, dims = 1:2, reduction = "pca") + ggtitle("PCA loadings: Canine Thymus") + theme(plot.title = element_text(hjust = -10, vjust = 3))

DimPlot(normalized_thym, reduction = "pca") + ggtitle("PCA: Canine Thymus")

DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analysis. Both cells and features are ordered according to their PCA scores. Setting ‘cells’ to a number plots the ‘extreme’ cells on both ends of the spectrum. Though a supervised analysis, this is a valuable tool for exploring correlated feature sets.

15 PC dimensions: Thymus

DimHeatmap(normalized_thym,
           dims = 1:15,
           cells = 500,
           balanced = TRUE)

Determine dataset dimensionality

To decide how many principal components to include, we rank principal components based on the percentage of variance explained by an “elbow” in the plot.

ElbowPlot(normalized_thym, ndims=40) + ggtitle("Elbow plot of principal components: Canine Thymus")

The location of the elbow in this plot suggests that the majority of the true signal is captured in the first 10 PCs. Based on this, the first 10 PCs will be used to generate cell clusters.

Cluster cells

Seurat first embeds cells in a K-nearest neighbor graph based on the euclidean distance in PCA space, and refines the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This draws edges between cells with similar feature expression patterns. To cluster the cells, modularity optimization is then applied to iteratively group cells together.

The resolution parameter determines how many clusters are created. Values above 1.0 generate a larger number of clusters, and values below 1.0 generate a smaller number of clusters. 0.4-1.2 typically returns good result for single-cell datasets of ~3k cells, but optimal resolution often increases for larger datasets. This script will initially set this parameter broadly and then evaluate the stability of the clusters drawn at each resolution using clustree.

normalized_thym <- FindNeighbors(normalized_thym, dims = 1:10) # replace with the PCs calculated above
normalized_thym <- FindClusters(normalized_thym, resolution = c(0.1,0.15,0.2,0.25,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1))
head(normalized_thym[[]])
normalized_thym[["RNA_snn_res.2"]] <- NULL # remove any additional resolutions not in the list above

clustree

We want to determine the optimal resolution and cluster number to achieve an appropriate balance between over-clustering and under-clustering in order to keep cell population identities as accurate and biologically informative as possible. Seurat has parameters for tuning the resolution and number of clusters, but this does not determine whether these generated clusters are meaningful.

Drawing a clustering tree can help compare the stability of clusterings at a range of resolutions. The node size is related to the number of cells in each cluster, and the node color indicates the clustering resolution. Arrows are colored according to the number of samples they represent, and the arrow transparency shows the incoming node proportion (the number of samples in the edge divided by the number of samples in the node it points to). As the tree becomes messier and nodes have multiple incoming edges, it indicates overclustering of the data. (Note that clustree does not do any clustering of its own, but rather checks the clustering done by Seurat.)

clustree2 <- clustree(normalized_thym, prefix = "RNA_snn_res.")
clustree2 + ggtitle("Canine Thymus Clusters") + theme(plot.title = element_text(hjust = 0.5))

Non-linear dimensional reduction (UMAP)

These algorithms learn the underlying structure in the dataset in order to place similar cells together in low-dimensional space. Cells that are grouped together within graph-based clusters determined above should co-localize on these dimension reduction plots. All visualization techniques have limitations and cannot fully represent the complexity of the underlying data.

FindClusters(normalized_thym, resolution = 0.2)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 22654
## Number of edges: 677273
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9488
## Number of communities: 11
## Elapsed time: 4 seconds
## An object of class Seurat 
## 14661 features across 22654 samples within 1 assay 
## Active assay: RNA (14661 features, 2000 variable features)
##  3 layers present: data, counts, scale.data
##  3 dimensional reductions calculated: pca, umap, integrated.cca
normalized_thym <- RunUMAP(normalized_thym, 
                           dims = 1:10, # replace with the PCs calculated above
                           n.neighbors = 50, # default is 30
                           min.dist = 0.5) # default is 0.3

DimPlot(normalized_thym,
        reduction = "umap",
                   label = TRUE,
                   label.size = 6) + 
  plot_annotation(title = "Canine Thymus, Resolution: 0.2, \nn.neighbors = 50, min.dist = 0.5", theme = theme(plot.title = element_text(hjust = 0.5, size = 20)))

Export data

Export Seurat objects once final clustering parameters are decided.

saveRDS(normalized_thym, file="THYM_NormalizedAndClustered.RData")

Citations

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] clustree_0.5.1     ggraph_2.2.1       knitr_1.49         RColorBrewer_1.1-3
##  [5] pheatmap_1.0.12    patchwork_1.3.0    lubridate_1.9.4    forcats_1.0.0     
##  [9] stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.5       
## [13] tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   
## [17] Seurat_5.2.0       SeuratObject_5.0.2 sp_2.1-4          
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.17.1      jsonlite_1.8.9         magrittr_2.0.3        
##   [4] spatstat.utils_3.1-1   farver_2.1.2           rmarkdown_2.29        
##   [7] vctrs_0.6.5            ROCR_1.0-11            memoise_2.0.1         
##  [10] spatstat.explore_3.3-3 htmltools_0.5.8.1      sass_0.4.9            
##  [13] sctransform_0.4.1      parallelly_1.40.1      KernSmooth_2.23-24    
##  [16] bslib_0.8.0            htmlwidgets_1.6.4      ica_1.0-3             
##  [19] plyr_1.8.9             plotly_4.10.4          zoo_1.8-12            
##  [22] cachem_1.1.0           igraph_2.1.2           mime_0.12             
##  [25] lifecycle_1.0.4        pkgconfig_2.0.3        Matrix_1.7-0          
##  [28] R6_2.5.1               fastmap_1.2.0          fitdistrplus_1.2-2    
##  [31] future_1.34.0          shiny_1.10.0           digest_0.6.37         
##  [34] colorspace_2.1-1       tensor_1.5             RSpectra_0.16-2       
##  [37] irlba_2.3.5.1          labeling_0.4.3         progressr_0.15.1      
##  [40] spatstat.sparse_3.1-0  timechange_0.3.0       httr_1.4.7            
##  [43] polyclip_1.10-7        abind_1.4-8            compiler_4.4.1        
##  [46] withr_3.0.2            viridis_0.6.5          fastDummies_1.7.5     
##  [49] ggforce_0.4.2          MASS_7.3-60.2          tools_4.4.1           
##  [52] lmtest_0.9-40          httpuv_1.6.15          future.apply_1.11.3   
##  [55] goftest_1.2-3          glue_1.8.0             nlme_3.1-164          
##  [58] promises_1.3.2         grid_4.4.1             Rtsne_0.17            
##  [61] cluster_2.1.6          reshape2_1.4.4         generics_0.1.3        
##  [64] gtable_0.3.6           spatstat.data_3.1-4    tzdb_0.4.0            
##  [67] data.table_1.16.4      hms_1.1.3              tidygraph_1.3.1       
##  [70] spatstat.geom_3.3-4    RcppAnnoy_0.0.22       ggrepel_0.9.6         
##  [73] RANN_2.6.2             pillar_1.10.1          spam_2.11-1           
##  [76] RcppHNSW_0.6.0         later_1.4.1            splines_4.4.1         
##  [79] tweenr_2.0.3           lattice_0.22-6         survival_3.6-4        
##  [82] deldir_2.0-4           tidyselect_1.2.1       miniUI_0.1.1.1        
##  [85] pbapply_1.7-2          gridExtra_2.3          scattermore_1.2       
##  [88] xfun_0.49              graphlayouts_1.2.1     matrixStats_1.5.0     
##  [91] stringi_1.8.4          lazyeval_0.2.2         yaml_2.3.10           
##  [94] evaluate_1.0.3         codetools_0.2-20       cli_3.6.3             
##  [97] uwot_0.2.2             xtable_1.8-4           reticulate_1.40.0     
## [100] munsell_0.5.1          jquerylib_0.1.4        Rcpp_1.0.13-1         
## [103] globals_0.16.3         spatstat.random_3.3-2  png_0.1-8             
## [106] spatstat.univar_3.1-1  parallel_4.4.1         dotCall64_1.2         
## [109] listenv_0.9.1          viridisLite_0.4.2      scales_1.3.0          
## [112] ggridges_0.5.6         rlang_1.1.5            cowplot_1.1.3
citation()
## To cite R in publications use:
## 
##   R Core Team (2024). _R: A Language and Environment for Statistical
##   Computing_. R Foundation for Statistical Computing, Vienna, Austria.
##   <https://www.R-project.org/>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2024},
##     url = {https://www.R-project.org/},
##   }
## 
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.