Single Cell RNA-Seq Analysis

Using R and Seurat

Sameet Mehta (msameet@gmail.com)

About Me

About Me

  • B.Sc. (Zoology, Micobiology, Chemistry), Pune University
  • M.Sc. (Molecular Biology), Pune University
  • Ph.D (Zoology), NCCS, Pune University
  • 4.5 years Postdoctoral Visiting Fellow NIH.
  • 6.5 years Research Facutly at Yale University
  • 1 year Scientist (Precision Oncology), SEMA4

Part I – Single-cell RNA Seq

Evolution

  • Micro-arrays
  • Highthroughput sequencing
    • Rise of the “omics”
  • single-cell “omics”

What is scRNA-Seq

  • single cell RNA-Sequencing

Important

  • This is a completely automated process.
  • Currently it just requires a good single-cell suspension.
  • Labelling, RNA-extraction, and library preparation happens inside a small device, and the final material can directly go on to the sequencer.

For every molecule sequenced there is associated information about

  • Which cell the molecule came from.
  • Which transcript the molecule came from.

Note

This is analysis of a sparse matrix with few thousand rows, and few thousand columns.

What is scRNA Seq

Requirements of scRNA Seq

Must Have

  • A good single-cell suspension.
  • 90%+ live cells, apoptotic cell populations can skew the analysis.
  • Odd shaped cells are a problem.
    • e.g. dendritic cells, nephrons, muscle

Single Nuclear Sequencing is a way around dissociation of tissue problems.

Why scRNA Seq

  • Bulk RNA-Seq averages out the signal.
  • You cannot do much with a “single” sample.
    • No replicates – no insights.
    • “How many?” – always a question. There are always trade-offs between monetary considerations and scientific rigor.
    • With more replicates you need to take care of the “batch” effects as well.
  • Each cell is potentially a “replicate”, so you have 100s of replicates.
  • Analysis at scale generates interesting biological insight.
  • Biology can be captured at scale.
  • There are constant advances being made in technology and analysis methology to allow even better insights.
  • There now more than just “RNA” seq possible at single-cell resolution.

Single Cell “Omics”

Current state-of-the-art

In addition to the scRNA Seq there is another big technology used widely

  • Single cell proteomics with technologies such as CyToF
    • Cellular Time-of-flight – The cells are labelled with heavy metal tagged antibodies (unlike fluorescent labelled antibodies used in FACS).
    • Allows upto 32 proteins to be quantified both surface markers and intra-cellular.
    • Can capture data for hundreds of thousands of cells at a time.
  • Single cell proteomics with spatial resolution with ImC (Imaging CyToF)
    • Same as CyToF, but on FFPE, or Fresh-Frozen tissue sections.
    • Typically done on tissue micro-arrays where a single slide has upto 60 sections arrayed for uniform staining.

Recent Advances

  • It is now possible to interrogate more biological modalities at single-cell level:
    • ATAC-Seq for single-cell single-base resolution methylation status.
    • scHiC for single-cell 1 kb resolution study of 3D chromatin organization.
  • spatial scRNA-Seq for analysis of single-cell RNAseq with spatial resolution on FFPE-tissue section (very recent), and fresh-frozen tissue sections e.g. biopsies (been around for about 3 years).
  • We have a Human Cell Atlas project that is underway to map all the cell-types seen in human body.

Pitfalls with scRNA Seq

  • Requires single-cell, clump free suspension, limiting use.
  • Requires sequencing to a huge depth. Generally about 10000 reads per cell (mappable).
  • The data analysis compute requirements can balloon quickly.
    • Unfortunately for very large data-sets R stops being most useful language for processing.
    • python works for larger data sets, but it adds some extra processing steps for visualization. Although, it is getting better.
  • The commercial platforms are not very cheap. They are finally within grasp of individual labs, but still remain very expensive to be used routinely.
    • There are a few cheaper alternatives e.g. DropSeq, and SeqWell which are also relatively easy to use, but post processing is fairly difficult.

Why R?

Personal Opinion (mostly!)

  • Open-source
  • Available for most platforms.
  • A lot of sophisticated graphing capability is available out of the box.
  • New packages are being added all the time.
  • With the tidyverse family of packages, and ggplot2, using R has now become much more democratized.

Part II – Data Description

Sample Qualities

  • This is a sample derived from human blood, after removing the red-blood corpuscles.
  • The remaining cells are mostly immune cells
    • Immune cells have well defined cell populations.
    • These populations can be identified by specific combinations of cell-surface protein markers.
    • Each type of cell has specific role in the immune system.

Sample Data

dat <- Read10X("./data/filtered_gene_bc_matrices/hg19")
dat[1:5, 1:5]
5 x 5 sparse Matrix of class "dgCMatrix"
             AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
MIR1302-10                  .                .                .
FAM138A                     .                .                .
OR4F5                       .                .                .
RP11-34P13.7                .                .                .
RP11-34P13.8                .                .                .
             AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
MIR1302-10                  .                .
FAM138A                     .                .
OR4F5                       .                .
RP11-34P13.7                .                .
RP11-34P13.8                .                .

Note the sparse data, where the 0s are denoted with ..

scRNA-Seq data is sparse. Usually each cell can have upto 10K genes detected (depending on the chemistry)

Part III – Analysis

Analysis Method

Analysis is typically in following steps:

  • Read in the count data.
  • Plot some standard QC to make sure the data is okay.
    • If required subset the data.
  • Normalize the data. This is to make sure that the distribution is easier to model. Typically LogNormalization. This works the best for count data because the dynamic range can be very high.
  • Identify highly variable features (genes)
  • scale data.
  • Reduce dimensions, typically PCA.
    • Plot this data (optionally).
  • Cluster the cells.
  • Non-linear dimensionality reduction.
  • Plotting.
  • Find cluster specific genes.
  • Downstream analysis … (not part of our current session)

Basic QC

Basic QC plots

Basic QC (continued)

Identify Variable Features

Variable Genes

We generally use these variable features for the dimensionality reduction. We still have whole data so we can plot for any part of the information.

Scale Data

pbmc <- ScaleData(pbmc)

Uses only the VariableFeatures identified earlier.

Note

All the operations are being done on the same object, and saved in the same object.
All information can be found in the analysis objet structure.

Linear dimensionality reduction (PCA)

Determine true dimesionality

How to Select the Top Important Principal Components?

Elbow plot

Note

This is a very subjective matter. Typically, downstream analysis will inform how many PCs to use. From this point onwards a typical analysis can become very iterative.

Clustering

  • One of the most basic methods is followed in the Seurat approach.
  • This is currently one of the most actively worked on aspect of high-dimensional biological data in general, and single-cell genomics data analysis in particular.
  • There are other graph based methods that are more optimal when analyzing multiple single-cell datasets together.

Clustering

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2700
Number of edges: 97892

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8719
Number of communities: 9
Elapsed time: 0 seconds

Clustering

UMAP plot

Clustering

Comparison between linear and non-linear dimensionality reduction.

Clustering

Note

As you can see non-linear dimensionality reduction like UMAP (also tSNE) brings out more structure from the data.

PCA also brings out the salient structure of the data, e.g. you can see two clear clusters separated by PC1, and PC2, but other fine structure will only become apparent as you plot more combinations of the PCs.

Find Cluster Specific Markers

Look for the Biology!

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>%
  group_by(cluster) %>%
  slice_max(n = 2, order_by = avg_log2FC)
# A tibble: 18 × 7
# Groups:   cluster [9]
       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
       <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
 1 1.10e- 81       1.33 0.432 0.111 1.51e- 77 0       CCR7     
 2 2.53e- 48       1.08 0.336 0.108 3.46e- 44 0       PRKCQ-AS1
 3 0               5.50 0.994 0.215 0         1       S100A9   
 4 0               5.47 0.967 0.121 0         1       S100A8   
 5 4.78e- 57       1.22 0.42  0.113 6.56e- 53 2       AQP3     
 6 1.61e- 82       1.18 0.983 0.642 2.21e- 78 2       LTB      
 7 0               4.29 0.934 0.043 0         3       CD79A    
 8 8.49e-274       3.58 0.619 0.022 1.16e-269 3       TCL1A    
 9 4.29e-201       3.20 0.599 0.049 5.88e-197 4       GZMK     
10 1.52e-211       3.06 0.95  0.229 2.08e-207 4       CCL5     
11 5.51e-184       3.34 0.975 0.135 7.56e-180 5       FCGR3A   
12 3.17e-124       3.01 1     0.315 4.35e-120 5       LST1     
13 1.51e-170       4.82 0.939 0.135 2.07e-166 6       GNLY     
14 1.09e-259       4.81 0.966 0.072 1.49e-255 6       GZMB     
15 1.43e-267       3.87 0.833 0.009 1.96e-263 7       FCER1A   
16 3.90e- 33       2.91 0.944 0.207 5.34e- 29 7       HLA-DQA1 
17 4.24e-116       8.47 1     0.024 5.81e-112 8       PPBP     
18 6.19e-182       7.13 0.933 0.011 8.49e-178 8       PF4      

Find Cluster Specific Markers

Look for the Biology!

Marker expression distribution across the data

Find Cluster Specific Markers

Look for the Biology!

Marker expression distribution across the data

Downstream Analysis

Once you have a list of genes (no matter how you obtained it!) you can do all kinds of analyses with it.

Most of the downstream analyses will be dictated by experiment design.

This example had only one sample, there are experimental designs where you have multiple samples, e.g. treatment/control, WT/KO, time-series, etc.

Take Home

  • scRNA seq is a versatile and powerful technique to generate biological insights.
  • With packages like Seurat the analysis of scRNA-Seq data becomes quite easy.
  • With proper analysis the data will reveal underlying biology, and that automatically acts like a “positive-control”.
  • This is not even “scratching” the surface. There is a huge list of algorithms and programs that can be used for single cell omics data analysis at https://github.com/seandavi/awesome-single-cell.

Questions?

Thank you!