Today our goal is to use Seurat and use the PBMC dataset. We will need 3 packages, called “Seurat”, “dplyr”, and “patchwork.” We’ll run through the same process as usual to get started.
# Remember our four steps!
# Step 1) Load packages
library(Seurat) # This package has functions to analyze single-cell data.
library(dplyr) # This package allows us to use reshape our data - like "pliers"
library(patchwork) # This package is a "helper" package - we won't use it directly
# Step 2) Load in data (and ensure working directory is correct)
getwd()
setwd('/Users/iancoccimiglio/OneDrive/RstudioCode/CodingClub/Four/')
pbmc.data <- Read10X('../data/filtered_gene_bc_matrices/hg19/') # Reading in the data
pbmc <- CreateSeuratObject(pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) # Creates a Seurat Object, subsets only the features (genes) that are contained in at least 3 cells, and the cells which contain at least 200 features.
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
Three levels of understanding of functions
1. What are each function’s inputs and outputs? (> We’ll
mostly stay here <)
2. How does each function work in theory? (Important to
understand)
3. How does each function work internally? (Impractical, except for
developers)
* We can use View() for this, if needed.
# head(pbmc.data) - not showing it here because the output is large
For now, this looks messy. The reason is that this type of data
matrix is called sparse. In scRNAseq, many cells do not show
any RNA, and many genes may not appear. So the real matrix is full of
“zeros”. But it’s wasteful and inefficient to store and calculate
thousands of zeros.
So, R makes a dcgMatrix object which only features the positive values. I can explain in more detail if anyone is interested!
head(pbmc)
## orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1 pbmc3k 2419 779
## AAACATTGAGCTAC-1 pbmc3k 4903 1352
## AAACATTGATCAGC-1 pbmc3k 3147 1129
## AAACCGTGCTTCCG-1 pbmc3k 2639 960
## AAACCGTGTATGCG-1 pbmc3k 980 521
## AAACGCACTGGTAC-1 pbmc3k 2163 781
## AAACGCTGACCAGT-1 pbmc3k 2175 782
## AAACGCTGGTTCTT-1 pbmc3k 2260 790
## AAACGCTGTAGCCA-1 pbmc3k 1275 532
## AAACGCTGTTTCTG-1 pbmc3k 1103 550
This is much more interpretable. Seurat makes a big object which contains all of the useful single-cell information. Here, 13,714 different features (genes) are detected across 2700 biological cells. Each row is a specific cell. nCount shows the number of molecules detected in each cell, whereas nFeatures shows the number of unique genes in each cell.
# Remember: Step 3+4) Use functions to plot and process the data.
any(pbmc$nCount_RNA <= pbmc$nFeature_RNA) # New function! any() checks all rows for any particular value that is TRUE. Why do you think this value is false?
## [1] FALSE
We are going to create our first violin plot, using VlnPlot(). We input our Seurat object (pbmc), but we are visualizing specific features of the object as a violin plot. ncol() sets the number of columns for our plots.
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2) # Creates a violin plot from pbmc. Wow!
We use PercentageFeatureSet() to calculate and create a percentage of counts for each biological cell. Here, we are using it to identify mitochondrial genes. I can explain “pattern”, if there’s time.
Important to note - Our Seurat object is extensible. We use the “[[]]” notation to add more values to our Seurat Object.
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # New function!
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) # Now we have more information in our pbmc object.
What if we wanted to compare aspects of our single-cell data against other aspects? Here, we are comparing amount of RNA to percent mitochondrial, and amount of RNA against unique features in each cell.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") # FeatureScatter() allows us to create scatter plots of any of our single cell data. My interpretation is that cells with lots of mitochondrial genes are likely low-quality/dying, and therefore do not express many other RNA.
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") # We should expect a relatively linear relationship between them (more RNA detected in a cell -> more unique genes discovered in that cell)
plot1 + plot2 # Combine and show both plots side-by-side
Here, we remove any cells that show excessive mitochondrial genes, too many/few unique RNA values (possible doublets?).
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) # Remember, the subset function takes an object, then KEEPS the values that meets the conditions you set. Here, we are removing biological cells that don't meet our criteria.
# End off by doing another sanity check showing that now, the cells with high counts mitochondrial genes are gone from our Seurat object.
FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
Functions from last week:
New general R functions:
New Seurat functions: