1 Introduction

The aim of the SeuratPipe package is to remove technical details from the user, enabling wet lab researchers to have an ‘initial look’ at the data themselves.

In this vignette we will show how to perform different Seurat analyses using SeuratPipe. To run the contents of this vignette you should download the data from this Github repository https://github.com/andreaskapou/SeuratPipe_tutorial

2 SeuratPipe installation

# install.packages("remotes")

# To install the stable release
remotes::install_github("andreaskapou/SeuratPipe@release/v1.0.0")

# To install the latest source version (not recommended)
remotes::install_github("andreaskapou/SeuratPipe")

2.1 Scrublet Python installation

Scrublet is a method for doublet detection, Github page: https://github.com/swolock/scrublet.

The method is implemented in Python, hence when installing SeuratPipe, the Scrublet package cannot be installed automatically. We can try to install and run Scrublet within R using the reticulate package. I wrote a helper function to ease installation of Scrublet. In R type:

SeuratPipe::install_scrublet()
# type ?install_scrublet for more details

If this fails, you need to follow details in reticulate package on how to install any Python package on your machine.

3 Datasets

Initially, using the PBMC3K dataset, we will show how to read 10x output files and create Seurat objects, perform QC filtering and subsequent processing and clustering steps.

Next, we will use the IFNB dataset, which contains two samples (control and stimulated cells) to perform data integration with Harmony.

First we load required packages

library(SeuratPipe)
library(Seurat)

4 Running QC pipeline

SeuratPipe assumes the user will have a sample metadata file/data frame which contains all the metadata for each scRNA-seq sample. It also requires the following column names to be present: sample, donor, path, condition and pass_qc (logical TRUE or FALSE).

For instance the below metadata file looks like this:

sample	donor	path	condition	pass_qc	technology
pbmc1	pbmc1	pbmc3k	Healthy	TRUE	whole-cell

Settings for QC

# Assume 10x data are in 'data' and we have metadata file 'meta_pbmc3k.csv'
io <- list()
io$data_dir <- "data/"
io$out_dir <- "pbmc_results/qc/"
io$tenx_dir <- "/outs/" # Assume whole-cell
io$meta_file <- paste0(io$data_dir, "/meta_pbmc3k.csv")

# Opts
opts <- list()
opts$nfeat_thresh <- 200
opts$mito_thresh <- 5
opts$meta_colnames <- c("donor", "condition", "pass_qc")
opts$qc_to_plot <- c("nFeature_RNA", "nCount_RNA", "percent.mito")

We can perform all QC pipeline steps using a single wrapper function run_qc_pipeline as shown below. This function will also generate all the QC plots.

NOTE: The wrapper will also store an .rds file, named (seu_qc.rds) which contains the Seurat object and the opts we defined for our analysis. This is done for reproducibility purposes (see Cluster analysis section how to read this file).

seu <- run_qc_pipeline(
  data_dir = io$data_dir, 
  sample_meta = NULL, 
  sample_meta_filename = io$meta_file,
  nfeat_thresh = opts$nfeat_thresh, 
  mito_thresh = opts$mito_thresh,
  meta_colnames = opts$meta_colnames,
  out_dir = io$out_dir,
  qc_to_plot = opts$qc_to_plot,
  use_scrublet = FALSE, 
  use_soupx = FALSE,
  tenx_dir = io$tenx_dir)

4.1 Advanced

If you want more control over qc analysis, the above function wraps the following functions and generates plots.

# First create Seurat
seu <- create_seurat_object(data_dir, sample_meta, ...)
# Then perform filtering
seu <- filter_seurat_object(seu, ...)

5 Running cluster pipeline

It is advised to have different files for different analysis tasks, here QC and clustering. Because, most likely you will perform QC once (or two), but clustering (and other downstream tasks) will be performed multiple times. This is the reason we store the QCed Seurat object so we can read it directly from the disk.

That is why here I am defining again the settings for the clustering pipeline

# I/O
io <- list()
io$base_dir <- "pbmc_results/"
io$out_dir <- paste0(io$base_dir, "/cluster/")

# Read Seurat object stored after QC
seu_obj <- readRDS(file = paste0(io$base_dir, "/qc/seu_qc.rds"))
seu <- seu_obj$seu   # Extract Seurat object
opts <- seu_obj$opts # Extract opts from QC (helpful for reproducibility)

##
# Clustering opts
##
# Number of PCs, can be a vector: c(30, 50)
opts$npcs <- c(50)
# Top PCs to perform UMAP and clustering, can be a vector: c(30, 40)
opts$ndims <- c(40)
# Clustering resolutions
opts$res <- seq(from = 0.2, to = 0.2, by = 0.1)
# QC information to plot
opts$qc_to_plot <- c("nFeature_RNA", "percent.mito")
# Metadata columns to plot
opts$meta_to_plot <- c("condition", "sample")
# Maximum cutoff values for plotting continuous features, e.g. gene expression
# Gives better plots where colour scale is not driven by a (few) outlier cells.
# Set to NULL to recoved default Seurat plots
opts$max.cutoff <- "q98"

In SeuratPipe we can also define module groups, consisting of known marker genes, that can help in potentially annotating cell populations.

# Define modules for lineage and cell cycle
lineage <- list(lin_CD4 = c("IL7R", "CCR7"),
                lin_Monocyte = c("CD14", "LYZ", "FCGR3A"),
                lin_NK = c("GNLY", "NKG7"))
# Take a random subset from Seurat's cycle genes
cell_cycle <- list(cc_s = c("PCNA", "RRM1", "POLA1", "USP1"),
                   cc_g2m = c("HMGB2", "MKI67", "TMPO", "CTCF"))

# Group all modules in named list to pass to SeuratPipe functions
opts$modules_group <- list(lineage = lineage,
                           cell_cycle = cell_cycle)

Here we show how to perform the most common analysis pipeline in Seurat for cell type identification. The wrapper function performs the following steps:

Data processing, e.g. Log normalisation and scaling.
Running PCA.
Perform UMAP and clustering.
Generate plots / outputs in corresponding directories.

NOTE: If multiple npcs and ndims are given as input, the updated Seurat object with the last setting will be returned.

# Run pipeline, note you can define additional parameters
# (e.g. top N genes to plot per cluster), which are left 
# to default values. type ?run_cluster_pipeline for help.
seu <- run_cluster_pipeline(
  seu_obj = seu, 
  out_dir = io$out_dir, 
  npcs = opts$npcs,
  ndims = opts$ndims, 
  res = opts$res,
  modules_group = opts$modules_group,
  metadata_to_plot = opts$meta_to_plot,
  qc_to_plot = opts$qc_to_plot, 
  max.cutoff = opts$max.cutoff,
  pcs_to_remove = NULL, seed = 1)

## Res 0.2 
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2642
## Number of edges: 162022
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9304
## Number of communities: 5
## Elapsed time: 0 seconds

5.1 Advanced

If you want more control over the clustering pipeline, the above function wraps the following functions. The code below works for a single choice of npcs and ndims whereas the wrapper function allows multiple values to be passed as a vector.

# Process and PCA
seu <- lognormalize_and_pca(seu, npcs = 50, ...)
# UMAP
seu <- run_umap(seu, dims = 1:opts$npcs, reduction="pca", ...)
# QC plots on dimensional reduced space
dimred_qc_plots(seu, reductions = c("umap", "pca"), ...)
# Module score analysis
seu <- module_score_analysis(seu, modules_group, ...)
# Clustering
seu <- cluster_analysis(seu, dims = 1:30, res = opts$res, ...)

Users can have even more control over cluster_analysis, and the main functions to perform clustering are the following:

# Find neighbors
seu <- find_neighbors(seu, dims = 1:opts$npcs, reduction = "pca", ...)
# Clustering with specific resolution
seu <- find_clusters(seu, resolution = 0.3, ...)
# Identify marker genes
markers <- find_all_markers(seu, logfc.threshold = 0.5, 
                            min.pct = 0.25, only.pos = TRUE)

Note that the naming convention is to use exactly the same function names as Seurat package, but using underscores instead of CamelCase, which is the default in Seurat. The same holds for the plotting functions (see examples in next section).

In case we had a list of Seurat objects and wanted to perform clustering analysis for each sample independently, the code would look like this:

for (s in names(seu)) {
  sample_dir <- paste0(io$out_dir, "/", s, "/")
  seu[[s]] <- run_cluster_pipeline(seu_obj = seu[[s]], 
                                   out_dir = sample_dir, ...)
}

5.2 Plotting / Visualisation

SeuratPipe also provides tailored functions to generate high quality plots, by extending Seurat’s built in plotting functions.

# UMAP plot
dim_plot(seu, reduction = "umap", group.by = "seurat_clusters")

# UMAP plot using SeuratPipe colour palette
dim_plot(seu, reduction = "umap", group.by = "seurat_clusters", 
         col_pal = SeuratPipe:::discrete_col_pal)

# Plotting specific genes
feature_plot(seu, features = c("LYZ", "NKG7"), 
             ncol = 2, legend.position = "top")

# Plotting lineage module scores as feature plot
feature_plot(seu, features = names(lineage), ncol = 3) & NoAxes()

# Plotting lineage module scores as dot plot
dot_plot(seu, features = names(lineage))

We can also check if specific technical factors are correlated with specific principal components.

pca_feature_cor_plot(seu, features = c("nFeature_RNA", "percent.mito"))

The above plot can also be useful to understand which specific principal components explain/capture information about different cell types/lineages.

# Correlation of lineage scores with PCs
pca_feature_cor_plot(seu, features = names(lineage))

6 Harmony integration

To showcase the pipeline for Harmony integration we will first load a pre-processed Seurat immune dataset called ifnb. The object is in the same format as if we were to run QC with the code in run_qc_pipeline.

Reading Seurat object and defining settings for Harmony pipeline. Note that parameters are almost identical to run_cluster_pipeline, with minor differences, such as the run_harmony_pipeline can accept a list of Seurat objects (i.e. multiple samples), and internally processed and merges the data. If you already have merged and processed your data, the pipeline accepts a single Seurat object and will skip the merging step.

# I/O
io <- list()
io$base_dir <- "ifnb_results/"
io$out_dir <- paste0(io$base_dir, "/integrate/")

# Read QCed Seurat object
seu_obj <- readRDS(file = paste0(io$base_dir, "/qc/seu_qc.rds"))
seu <- seu_obj$seu   # Extract (list of) Seurat objects
opts <- seu_obj$opts # Extract opts from QC (here empty list)

# Remove object and return memory to the system
rm(seu_obj)
gc(verbose = FALSE)

##            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells  3149299 168.2    5190695 277.3         NA  5190695 277.3
## Vcells 50928012 388.6   68900900 525.7      32768 57350444 437.6

##
# Clustering opts
##
# Number of PCs, can be a vector: c(30, 50)
opts$npcs <- c(50)
# Top Harmony dims for UMAP and clustering, can be a vector: c(30, 40)
opts$ndims <- c(40)
# Clustering resolutions
opts$res <- seq(0.1, 0.1, by = 0.1)
# Batch ID to perform integration
opts$batch_id <- "sample"
# QC information to plot
opts$qc_to_plot <- c("nFeature_RNA", "percent.mito")
# Metadata columns to plot
opts$meta_to_plot <- c("condition")
# Maximum number of Harmony iterations
opts$max.iter.harmony <- 50
# If intermediate object 'seu_harmony_<>.rds' exists (e.g. from 
# earlier analysis) and force_reanalysis = FALSE, read object instead 
# of re-running Harmony integration. Added for computing time efficiency.
opts$force_reanalysis <- TRUE
# Maximum cutoff values for plotting continuous features, e.g. gene expression
# Gives better plots where colour scale is not driven by a (few) outlier cells.
# Set to NULL to recoved default Seurat plots
opts$max.cutoff <- "q98"

Next we are ready to perform the full Harmony integration pipeline and obtain our final clusters.

# Run pipeline, note you can define additional parameters
# (e.g. top N genes to plot per cluster), which are left 
# to default values. type ?run_harmony_pipeline for help.
seu <- run_harmony_pipeline(
  seu_obj = seu, 
  batch_id = opts$batch_id,
  out_dir = io$out_dir, 
  npcs = opts$npcs,
  ndims = opts$ndims, 
  res = opts$res,
  modules_group = NULL,
  metadata_to_plot = opts$meta_to_plot,
  qc_to_plot = opts$qc_to_plot, 
  max.cutoff = opts$max.cutoff,
  force_reanalysis = opts$force_reanalysis,
  max.iter.harmony = opts$max.iter.harmony,
  pcs_to_remove = NULL, seed = 1)

## Res 0.1 
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 13999
## Number of edges: 548086
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9674
## Number of communities: 9
## Elapsed time: 1 seconds

6.1 Advanced

If you want more control over the Harmony integration pipeline, the above function wraps the following functions. The code below works for a single choice of npcs and ndims whereas the wrapper function allows multiple values to be passed as a vector.

# Harmony analysis
seu <- harmony_analysis(seu, npcs = 50, ...)
# UMAP
seu <- run_umap(seu, dims = 1:opts$npcs, reduction="pca", ...)
# QC plots on dimensional reduced space
dimred_qc_plots(seu, reductions = c("umap", "pca"), ...)
# Module score analysis
seu <- module_score_analysis(seu, modules_group, ...)
# Clustering
seu <- cluster_analysis(seu, dims = 1:30, res = opts$res, ...)

6.2 Visualistion

dim_plot(seu, group.by = "seurat_clusters",
         col_pal = SeuratPipe:::discrete_col_pal)

dim_plot(seu, group.by = "condition", split.by = "condition", ncol = 2)

# Plotting specific genes
feature_plot(seu, features = c("LYZ", "NKG7"), 
             ncol = 2, legend.position = "top")

7 Session Info

This vignette was compiled using the following packages:

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SeuratObject_4.0.4 Seurat_4.1.0       SeuratPipe_1.0.0   BiocStyle_2.20.2  
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15            colorspace_2.0-3      deldir_1.0-6         
##   [4] ellipsis_0.3.2        ggridges_0.5.3        rstudioapi_0.13      
##   [7] spatstat.data_2.1-2   farver_2.1.0          leiden_0.3.9         
##  [10] listenv_0.8.0         ggrepel_0.9.1         RSpectra_0.16-0      
##  [13] fansi_1.0.2           codetools_0.2-18      splines_4.1.2        
##  [16] knitr_1.37            polyclip_1.10-0       jsonlite_1.8.0       
##  [19] ica_1.0-2             cluster_2.1.2         png_0.1-7            
##  [22] pheatmap_1.0.12       uwot_0.1.11           shiny_1.7.1          
##  [25] sctransform_0.3.3     spatstat.sparse_2.1-0 BiocManager_1.30.16  
##  [28] compiler_4.1.2        httr_1.4.2            assertthat_0.2.1     
##  [31] Matrix_1.4-0          fastmap_1.1.0         lazyeval_0.2.2       
##  [34] limma_3.48.3          cli_3.2.0             later_1.3.0          
##  [37] htmltools_0.5.2       tools_4.1.2           igraph_1.2.11        
##  [40] gtable_0.3.0          glue_1.6.2            RANN_2.6.1           
##  [43] reshape2_1.4.4        dplyr_1.0.8           Rcpp_1.0.8.3         
##  [46] scattermore_0.8       jquerylib_0.1.4       vctrs_0.3.8          
##  [49] nlme_3.1-155          lmtest_0.9-40         spatstat.random_2.1-0
##  [52] xfun_0.30             stringr_1.4.0         globals_0.14.0       
##  [55] mime_0.12             miniUI_0.1.1.1        lifecycle_1.0.1      
##  [58] irlba_2.3.5           gtools_3.9.2          goftest_1.2-3        
##  [61] future_1.24.0         MASS_7.3-55           zoo_1.8-9            
##  [64] scales_1.1.1          spatstat.core_2.4-0   promises_1.2.0.1     
##  [67] spatstat.utils_2.3-0  parallel_4.1.2        RColorBrewer_1.1-2   
##  [70] yaml_2.3.5            reticulate_1.24       pbapply_1.5-0        
##  [73] gridExtra_2.3         ggplot2_3.3.5         sass_0.4.0           
##  [76] rpart_4.1.16          stringi_1.7.6         highr_0.9            
##  [79] harmony_0.1.0         rlang_1.0.2           pkgconfig_2.0.3      
##  [82] matrixStats_0.61.0    evaluate_0.15         lattice_0.20-45      
##  [85] ROCR_1.0-11           purrr_0.3.4           tensor_1.5           
##  [88] labeling_0.4.2        patchwork_1.1.1       htmlwidgets_1.5.4    
##  [91] cowplot_1.1.1         tidyselect_1.1.2      parallelly_1.30.0    
##  [94] RcppAnnoy_0.0.19      plyr_1.8.6            magrittr_2.0.2       
##  [97] bookdown_0.25         R6_2.5.1              magick_2.7.3         
## [100] generics_0.1.2        DBI_1.1.2             withr_2.5.0          
## [103] mgcv_1.8-39           pillar_1.7.0          fitdistrplus_1.1-8   
## [106] survival_3.3-1        abind_1.4-5           tibble_3.1.6         
## [109] future.apply_1.8.1    crayon_1.5.0          KernSmooth_2.23-20   
## [112] utf8_1.2.2            spatstat.geom_2.3-2   plotly_4.10.0        
## [115] rmarkdown_2.13        viridis_0.6.2         grid_4.1.2           
## [118] data.table_1.14.2     digest_0.6.29         xtable_1.8-4         
## [121] tidyr_1.2.0           httpuv_1.6.5          munsell_0.5.0        
## [124] viridisLite_0.4.0     bslib_0.3.1

8 Acknowledgments

I would like to thank John Wilson-Kanamori, Jordan Portman and Nick Younger. The SeuratPipe package has adapted many of the analyses pipelines that were developed by the computational biology group in Neil Henderson’s lab at the University of Edinburgh.

This package was supported by funding from the University of Edinburgh and Medical Research Council (core grant to the MRC Institute of Genetics and Cancer) for the Cross-Disciplinary Fellowship (XDF) programme.

SeuratPipe - Streamlining Seurat analysis

2022-04-12

Package