Introduction

The purpose of this script is to integrate scRNA-seq data from all samples arising from the same location (thymus, lymph node), which can help match shared cell types and states across datasets, boost statistical power, and facilitate accurate comparative analysis across datasets. The integration procedure aims to return a single dimensional reduction that captures the shared sources of variance across multiple layers, so that cells in a similar biological state will cluster. The method returns a dimensional reduction which can be used for visualization and unsupervised clustering analysis.

Loading data and packages

Load necessary software packages

library(Seurat)
library(tidyverse)
library(patchwork)
library(pheatmap)
library(RColorBrewer)
library(data.table)
library(knitr)
library(DoubletFinder)
library(presto)

Set working directory

setwd("C:/Users/edlarsen/Documents/240828_scRNAseq")

Load filtered Seurat objects

Input: A filtered Seurat object with doublets removed.

filtered_ln <- readRDS(file = "C:/Users/edlarsen/Documents/240828_scRNAseq/Integration/seurat_merged_LN_filtered.RData")
filtered_thym <- readRDS(file = "C:/Users/edlarsen/Documents/240828_scRNAseq/Integration/seurat_merged_THYM_filtered.RData")

Split layers prior to integration

filtered_ln[["RNA"]] <- split(filtered_ln[["RNA"]], f = filtered_ln$orig.ident) # lymph node 
filtered_thym[["RNA"]] <- split(filtered_thym[["RNA"]], f = filtered_thym$orig.ident) # thymus

Analyze data prior to integration

These parameters were selected based on recommendations in the Seurat documentation for performing integration: https://satijalab.org/seurat/articles/integration_introduction

# run standard anlaysis workflow
## Lymph  node
integ_ln <- NormalizeData(filtered_ln)
integ_ln <- FindVariableFeatures(integ_ln)
integ_ln <- ScaleData(integ_ln)
integ_ln <- RunPCA(integ_ln)

# visualize
integ_ln <- FindNeighbors(integ_ln, dims = 1:30, reduction = "pca")
integ_ln <- FindClusters(integ_ln, resolution = 2)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 29438
## Number of edges: 1037221
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8371
## Number of communities: 47
## Elapsed time: 3 seconds
integ_ln <- RunUMAP(integ_ln, dims = 1:30, reduction = "pca")
DimPlot(integ_ln, reduction = "umap", group.by = c("orig.ident", "seurat_clusters")) + # orig.ident is the sample ID
  plot_annotation(title = "Canine Lymph Node - Unintegrated", theme = theme(plot.title = element_text(hjust=0.5, size = 20)))

# run standard analysis workflow
## Thymus
integ_thym <- NormalizeData(filtered_thym)
integ_thym <- FindVariableFeatures(integ_thym)
integ_thym <- ScaleData(integ_thym)
integ_thym <- RunPCA(integ_thym)

# visualize
integ_thym <- FindNeighbors(integ_thym, dims = 1:30, reduction = "pca")
integ_thym <- FindClusters(integ_thym, resolution = 2)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 22654
## Number of edges: 847974
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8373
## Number of communities: 34
## Elapsed time: 2 seconds
integ_thym <- RunUMAP(integ_thym, dims = 1:30, reduction = "pca")
DimPlot(integ_thym, reduction = "umap", group.by = c("orig.ident", "seurat_clusters")) + # orig.ident is the sample ID
  plot_annotation(title = "Canine Thymus - Unintegrated", theme = theme(plot.title = element_text(hjust=0.5, size = 20)))

Perform integration

Canonical correlation analysis (CCA) identifies shared sources of variation between samples, but only identifies the greatest sources of variation in the data if it is shared or conserved across samples. CCA-based integration therefore enables integrative analysis when experimental conditions or disease states introduce very strong expression shifts, or when integrating datasets across modalities and species. However, CCA-based integration may also lead to overcorrection, especially when a large proportion of cells are non-overlapping across datasets.

RPCA-based integration runs faster, and represents a more conservative approach where cells in different biological states are less likely to ‘align’ after integration. RPCA is recommended when a substantial fraction of cells in one dataset have no matching type in the other, datasets originate from the same platform (i.e., multiple lanes of 10x genomics), or there are a large number of datasets or cells to integrate.

#### Returns error: Error in UseMethod(generic = "Assays", object = object) : no applicable method for 'Assays' applied to an object of class "NULL"
integ_ln <- IntegrateLayers(object = integ_ln,
                            method = CCAIntegration,
                            orig.reduction = "pca",
                            new.reduction = "integrated.cca",
                            verbose = FALSE)
integ_ln[["RNA"]] <- JoinLayers(integ_ln[["RNA"]]) # rejoin layers after integration
integ_thym <- IntegrateLayers(object = integ_thym,
                            method = CCAIntegration,
                            orig.reduction = "pca", 
                            new.reduction = "integrated.cca", 
                            verbose = FALSE)
integ_thym[["RNA"]] <- JoinLayers(integ_thym[["RNA"]]) # rejoin layers after integration

Visualize after integration

## Lymph node 
integ_ln <- FindNeighbors(integ_ln, reduction = "integrated.cca", dims = 1:30)
integ_ln <- FindClusters(integ_ln, resolution = 1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 29438
## Number of edges: 1284334
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8905
## Number of communities: 32
## Elapsed time: 4 seconds
integ_ln <- RunUMAP(integ_ln, dims = 1:30, reduction = "integrated.cca")
DimPlot(integ_ln, reduction = "umap", group.by = c("orig.ident", "seurat_clusters")) +
  plot_annotation(title = "Canine Lymph Node - Integrated", theme = theme(plot.title = element_text(hjust=0.5, size = 20)))

## Thymus
integ_thym <- FindNeighbors(integ_thym, reduction = "integrated.cca", dims = 1:30)
integ_thym <- FindClusters(integ_thym, resolution = 1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 22654
## Number of edges: 1004358
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8660
## Number of communities: 22
## Elapsed time: 2 seconds
integ_thym <- RunUMAP(integ_thym, dims = 1:30, reduction = "integrated.cca")
DimPlot(integ_thym, reduction = "umap", group.by = c("orig.ident", "seurat_clusters")) +
  plot_annotation(title = "Canine Thymus - Integrated", theme = theme(plot.title = element_text(hjust=0.5, size = 20)))

Save integrated Seurat objects

saveRDS(integ_ln, file="seurat_integrated_LN_filtered.RData")
saveRDS(integ_thym, file="seurat_integrated_THYM_filtered.RData")

Next step: Data normalization and clustering.

Citations

sessionInfo()
## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Denver
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] presto_1.0.0        Rcpp_1.0.13         DoubletFinder_2.0.4
##  [4] knitr_1.49          data.table_1.16.4   RColorBrewer_1.1-3 
##  [7] pheatmap_1.0.12     patchwork_1.3.0     lubridate_1.9.4    
## [10] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
## [13] purrr_1.0.2         readr_2.1.5         tidyr_1.3.1        
## [16] tibble_3.2.1        ggplot2_3.5.1       tidyverse_2.0.0    
## [19] Seurat_5.1.0        SeuratObject_5.0.2  sp_2.1-4           
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.17.1      jsonlite_1.8.9         magrittr_2.0.3        
##   [4] spatstat.utils_3.1-1   farver_2.1.2           rmarkdown_2.29        
##   [7] vctrs_0.6.5            ROCR_1.0-11            spatstat.explore_3.3-3
##  [10] htmltools_0.5.8.1      sass_0.4.9             sctransform_0.4.1     
##  [13] parallelly_1.40.1      KernSmooth_2.23-22     bslib_0.8.0           
##  [16] htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9            
##  [19] plotly_4.10.4          zoo_1.8-12             cachem_1.1.0          
##  [22] igraph_2.1.2           mime_0.12              lifecycle_1.0.4       
##  [25] pkgconfig_2.0.3        Matrix_1.7-0           R6_2.5.1              
##  [28] fastmap_1.2.0          fitdistrplus_1.2-2     future_1.34.0         
##  [31] shiny_1.10.0           digest_0.6.35          colorspace_2.1-1      
##  [34] tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1         
##  [37] labeling_0.4.3         progressr_0.15.1       spatstat.sparse_3.1-0 
##  [40] timechange_0.3.0       httr_1.4.7             polyclip_1.10-7       
##  [43] abind_1.4-8            compiler_4.4.0         withr_3.0.2           
##  [46] fastDummies_1.7.4      MASS_7.3-60.2          tools_4.4.0           
##  [49] lmtest_0.9-40          httpuv_1.6.15          future.apply_1.11.3   
##  [52] goftest_1.2-3          glue_1.7.0             nlme_3.1-164          
##  [55] promises_1.3.2         grid_4.4.0             Rtsne_0.17            
##  [58] cluster_2.1.6          reshape2_1.4.4         generics_0.1.3        
##  [61] gtable_0.3.6           spatstat.data_3.1-4    tzdb_0.4.0            
##  [64] hms_1.1.3              spatstat.geom_3.3-4    RcppAnnoy_0.0.22      
##  [67] ggrepel_0.9.6          RANN_2.6.2             pillar_1.10.1         
##  [70] spam_2.11-0            RcppHNSW_0.6.0         later_1.3.2           
##  [73] splines_4.4.0          lattice_0.22-6         survival_3.5-8        
##  [76] deldir_2.0-4           tidyselect_1.2.1       miniUI_0.1.1.1        
##  [79] pbapply_1.7-2          gridExtra_2.3          scattermore_1.2       
##  [82] xfun_0.49              matrixStats_1.4.1      stringi_1.8.4         
##  [85] lazyeval_0.2.2         yaml_2.3.10            evaluate_1.0.3        
##  [88] codetools_0.2-20       cli_3.6.2              uwot_0.2.2            
##  [91] xtable_1.8-4           reticulate_1.40.0      munsell_0.5.1         
##  [94] jquerylib_0.1.4        globals_0.16.3         spatstat.random_3.3-2 
##  [97] png_0.1-8              spatstat.univar_3.1-1  parallel_4.4.0        
## [100] dotCall64_1.2          listenv_0.9.1          viridisLite_0.4.2     
## [103] scales_1.3.0           ggridges_0.5.6         leiden_0.4.3.1        
## [106] rlang_1.1.3            cowplot_1.1.3
citation()
## To cite R in publications use:
## 
##   R Core Team (2024). _R: A Language and Environment for Statistical
##   Computing_. R Foundation for Statistical Computing, Vienna, Austria.
##   <https://www.R-project.org/>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2024},
##     url = {https://www.R-project.org/},
##   }
## 
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.