First, some review

Today our goal is to use Seurat and use the PBMC dataset. We will need 3 packages, called “Seurat”, “dplyr”, and “patchwork.” We’ll run through the same process as usual to get started.

# Remember our four steps! 
# Step 1) Load packages
library(Seurat) # This package has functions to analyze single-cell data.
library(dplyr) # This package allows us to use reshape our data - like "pliers"
library(patchwork) # This package is a "helper" package - we won't use it directly

# Step 2) Load in data (and ensure working directory is correct)
getwd()
setwd('/Users/iancoccimiglio/OneDrive/RstudioCode/CodingClub/Four/')
pbmc.data <- Read10X('../data/filtered_gene_bc_matrices/hg19/') # Reading in the data
pbmc <- CreateSeuratObject(pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) # Creates a Seurat Object, subsets only the features (genes) that are contained in at least 3 cells, and the cells which contain at least 200 features.
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

Pause, what are we looking at?

Three levels of understanding of functions
1. What are each function’s inputs and outputs? (> We’ll mostly stay here <)
2. How does each function work in theory? (Important to understand)
3. How does each function work internally? (Impractical, except for developers)
* We can use View() for this, if needed.

What is pbmc.data?
  • How can we examine it? Two new useful functions.
    • head(object), and tail(object)
  • Other methods?
    • type pbmc.data in the console
    • In RStudio, we have a variable viewer in the top right.
# head(pbmc.data) - not showing it here because the output is large

For now, this looks messy. The reason is that this type of data matrix is called sparse. In scRNAseq, many cells do not show any RNA, and many genes may not appear. So the real matrix is full of “zeros”. But it’s wasteful and inefficient to store and calculate thousands of zeros.

So, R makes a dcgMatrix object which only features the positive values. I can explain in more detail if anyone is interested!

What is a Seurat Object?
  • How can we examine it? Same as before.
    • head() and tail() still work! They are very flexible functions
    • type pbmc in the console
    • use variable viewer
head(pbmc)
##                  orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1     pbmc3k       2419          779
## AAACATTGAGCTAC-1     pbmc3k       4903         1352
## AAACATTGATCAGC-1     pbmc3k       3147         1129
## AAACCGTGCTTCCG-1     pbmc3k       2639          960
## AAACCGTGTATGCG-1     pbmc3k        980          521
## AAACGCACTGGTAC-1     pbmc3k       2163          781
## AAACGCTGACCAGT-1     pbmc3k       2175          782
## AAACGCTGGTTCTT-1     pbmc3k       2260          790
## AAACGCTGTAGCCA-1     pbmc3k       1275          532
## AAACGCTGTTTCTG-1     pbmc3k       1103          550

This is much more interpretable. Seurat makes a big object which contains all of the useful single-cell information. Here, 13,714 different features (genes) are detected across 2700 biological cells. Each row is a specific cell. nCount shows the number of molecules detected in each cell, whereas nFeatures shows the number of unique genes in each cell.

A Sanity Check

# Remember: Step 3+4) Use functions to plot and process the data.
any(pbmc$nCount_RNA <= pbmc$nFeature_RNA) # New function! any() checks all rows for any particular value that is TRUE. Why do you think this value is false?
## [1] FALSE

Violin Plots

We are going to create our first violin plot, using VlnPlot(). We input our Seurat object (pbmc), but we are visualizing specific features of the object as a violin plot. ncol() sets the number of columns for our plots.

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2) # Creates a violin plot from pbmc. Wow!

Identifying mitochondrial genes, then adding them to our violin QC plot.

We use PercentageFeatureSet() to calculate and create a percentage of counts for each biological cell. Here, we are using it to identify mitochondrial genes. I can explain “pattern”, if there’s time.

Important to note - Our Seurat object is extensible. We use the “[[]]” notation to add more values to our Seurat Object.

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # New function! 

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) # Now we have more information in our pbmc object.

Scatter plots

What if we wanted to compare aspects of our single-cell data against other aspects? Here, we are comparing amount of RNA to percent mitochondrial, and amount of RNA against unique features in each cell.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") # FeatureScatter() allows us to create scatter plots of any of our single cell data.  My interpretation is that cells with lots of mitochondrial genes are likely low-quality/dying, and therefore do not express many other RNA.

plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") # We should expect a relatively linear relationship between them (more RNA detected in a cell -> more unique genes discovered in that cell)

plot1 + plot2 # Combine and show both plots side-by-side

Finally, we perform some quality control.

Here, we remove any cells that show excessive mitochondrial genes, too many/few unique RNA values (possible doublets?).

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) # Remember, the subset function takes an object, then KEEPS the values that meets the conditions you set. Here, we are removing biological cells that don't meet our criteria.

# End off by doing another sanity check showing that now, the cells with high counts mitochondrial genes are gone from our Seurat object.
FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") 

Recap:

Functions from last week:

  • subset(), c(), $ selection.

New general R functions:

  • any() - allows us to check if any condition is TRUE from a list of conditions.
  • head(), tail() - allows us to see the top or bottom values of an object.
  • View() - allows us to see how a function works.

New Seurat functions:

  • Read10X() - Takes IN a folder of Features/Barcodes/Matrix created by 10X. Returns OUT a sparse matrix representing all of that information.
  • CreateSeuratObject() - Takes IN a sparse matrix, Returns OUT a human-interpretable Seurat Object.
  • VlnPlot() - Takes IN a Seurat Object and Features to plot. Returns OUT a violin plot, which is like a snazzy boxplot tbh. Shows densities, I guess.
  • PercentageFeatureSet() - Takes IN a Seurat object and pattern to look for in each biological cell. Returns OUT a percentage calculation for how much that pattern occurs across all cells.
  • FeatureScatter() - Takes IN a Seurat object, and 2 single-cell features to plot. Returns OUT a scatter plot of these features for each biological cell.