R objects are constructions of smaller building blocks. At the foundational level, we have singular, indivisible blocks. These are the ‘atoms’ of R.
The fundamental types of data, which we have seen before. Objects in R are represented as combinations of these. 1. Characters (also known as “strings”) 2. Numerics 3. Integers 4. Logical (also called Boolean)
Here, we stick together multiple of the SAME atoms to create a linear sequence of them. These would be the ‘molecules’ of R. In R, we call these vectors.
## How can we create empty vectors to store our data?
# 1. Make a character variable? Answer is below.
# 2. Make a numeric variable
# 3. Make an integer variable
# 4. Make a logical variable
charac = character(3)
charac[1] = 'Danny Devito'
charac[2] = 'Quentin Tarantino'
charac[3] = 'Samuel Jackson'
## And how might we do this directly?
char2 = c('Danny Devito', 'Quentin Tarantino', 'Samuel Jackson')
# What if we wanted to add a fourth name, 'Dolly Parton', to our list?
# How do we make a linear sequence of numbers?
# How can we analyze types of data? Two useful functions are "class()" and "typeof()". The key to these is to use them when you're getting unexpected results.
Important to note, a proper vector has only one elemental data-type. If this is true, it allows R to perform operations without causing errors.
If a vector has multiple data-types, then it is called a list. These lists aren’t directly operable - they are containers for other objects.
# The first operation works, it creates a list. The multiplication operation fails, because it can't understanding things that aren't numbers.
brokenList = list(45, 'Ian')
# brokenList*2
# Why does this work?
workingVector = c(45, 90)
result = workingVector*2
# We can create truly insane lists.
strangeList = list(charac, workingVector, brokenList)
# Making our own dataframe / spreadsheet
simulateDat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20) # A data-frame
library(dplyr)
library(Seurat)
library(patchwork)
getwd()
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc # View the object.
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # Get mitochondrial genes.
# Create violin plots for quality control.
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# Create scatterplots for the distribution of cells and their data-types
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
# Remove cells that have too few features (empty droplets), too many features (doublets), or express too many mitochondrial genes. These cells are the only cells 'worth' doing additional analysis on.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Our main goals here are to accomplish:
What are the gears of how normalization works in Seurat? First, we must understand why we do it.
Largely, in scRNA-seq, we normalize so that the data from cell 1 becomes relatively comparable with cell 2. That means Seurat will normalize one cell at a time. The baseline log-normalization operates by transforming each detected gene, dividing it by the total RNA in the cell, multiplying by 10,000, and then taking the logarithim.
We do this because a lot of how genes vary (in terms of detection) is because of capture-efficiency and reverse-transcription efficiency.
How is this done? Let’s perform subseting on just one single cell to see how.
# Recreating the normalization procedure. DON'T do this normally. This is just for understanding how these functions really work.
pbmc_tiny <- pbmc.data[,1] # Subset pbmc.data to get just one cell using [,1].
oneCell <- pbmc_tiny[pbmc_tiny > 0] # Subset just the genes that are above zero
length(oneCell) # how many genes are there?
## [1] 781
totalRNA <- sum(oneCell) # Total amount of RNA in the cell
normalizedRNA <- oneCell/totalRNA # normalized each gene to the total extracted RNA.
multipliedRNA <- normalizedRNA*10000 # multiplied by 10000, just for human-readability.
lognormRNA <- log1p(multipliedRNA) # Take the logarithim.
head(lognormRNA) # Normalized values!
## MRPL20 RPL22 TNFRSF25 EFHD2 NECAP2 AKR7A2
## 1.635208 1.635208 2.225817 1.635208 1.635208 1.635208
# How does Seurat do it normally?
head(pbmc)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
## AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
## AAACGCACTGGTAC-1 pbmc3k 2163 781 1.6643551
## AAACGCTGACCAGT-1 pbmc3k 2175 782 3.8160920
## AAACGCTGGTTCTT-1 pbmc3k 2260 790 3.0973451
## AAACGCTGTAGCCA-1 pbmc3k 1275 532 1.1764706
## AAACGCTGTTTCTG-1 pbmc3k 1103 550 2.9011786
pbmc <- NormalizeData(pbmc) # this puts normalized data in pbmc[['RNA']]@data
seuratNormPB <- pbmc[['RNA']]@data # get the normalized data
seuratOne <- seuratNormPB[,1] # subset just one cell
seuratOneCellGenes <- seuratOne[seuratOne > 0] # look only at genes greater than zero.
head(seuratOneCellGenes) # compare this to our results
## MRPL20 RPL22 TNFRSF25 EFHD2 NECAP2 AKR7A2
## 1.635873 1.635873 2.226555 1.635873 1.635873 1.635873
For our purposes here, we don’t care about genes that aren’t different between cells. In terms of clustering, we care mostly in looking at genes that vary the most across our cells.
Many genes (house-keeping genes, for example) are equivalently expressed between many cells, so these are removed here.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) # we have control over the amount of features we want to consider "differentially expressed".
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot1 + plot2
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1 rows containing missing values (geom_point).
I’d love to get more into scaling data. For the most part, its purpose is to make future computations (especially PCA) simpler for the computer. It is notoriously short in Seurat.
In brief, it has a similar role to standardizing a normal distribution. It moves the mean expression for each gene to zero, and rescales the variance for each gene to 1.
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
## Centering and scaling data matrix
PCA, despite being complex, is implemented very easily in Seurat. By default, it runs on the previously calculated variable features.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) # Performs PCA. The second argument isn't even needed.
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
## CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
## MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
## PLAC8, BLNK, MALAT1, SMIM14, PLD4, IGLL5, P2RX5, LAT2, SWAP70, FCGR2B
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
## HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HIST1H2AC, HLA-DPB1, PF4, SDPR
## TCL1A, HLA-DPA1, HLA-DRB1, HLA-DQA2, PPBP, HLA-DRA, LINC00926, GNG11, SPARC, HLA-DRB5
## GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, CLU, TUBB1, GZMB
## Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
## AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
## LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
## PC_ 5
## Positive: GZMB, S100A8, NKG7, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, RBP7
## CCL5, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, AKR1C3, TTC38, XCL1, HOPX
## Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## LILRB2, PTGES3, HN1, CD27, MAL, GDI2, CD2, CORO1B, ANXA5, TUBA1B
## FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, IFITM2
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, S100A8, NKG7, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
### Visualize the cells along their most variable dimensions
DimPlot(pbmc, reduction = "pca")
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE) # We can adjust the dimensions, and the cells, and whether we want an equal of +/- features.
Honestly, I want to gloss over this section. But briefly, we want to get rid of dimensions that don’t contribute useful information. These dimensions are likely characterized by mostly technical noise, rather than biological meaning.
# we can keep these commented out if we want. They're quite slow functions.
# pbmc <- JackStraw(pbmc, num.replicate = 100)
# pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# JackStrawPlot(pbmc, dims = 1:15)
What are we trying to do? We are looking for how “close” cells are in a higher-dimensional space. This involves copious amounts of graph theory. There are many methods for doing this. The linear-sense is the most obvious, but UMAP/TSNE work in some “non-linear” sense. As in, the normal linear idea of ‘distance’ doesn’t apply.
pbmc <- FindNeighbors(pbmc, dims = 1:10) #
## Computing nearest neighbor graph
## Computing SNN
pbmc <- FindClusters(pbmc, resolution = 0.5) # Previously good values are between 0.4-1.2
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95840
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8726
## Number of communities: 9
## Elapsed time: 0 seconds
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 2 3 2 1
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
Again, dramatically simple. We use the previously calculated principal components to create a UMAP object, then DimPlot to plot it.
pbmc <- RunUMAP(pbmc, dims = 1:12) # choose the number of dimensions you want
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
DimPlot(pbmc, reduction = "umap", label = TRUE) # Do we want labels?