Coding Club 7

Review of objects and data-types in R.

R objects are constructions of smaller building blocks. At the foundational level, we have singular, indivisible blocks. These are the ‘atoms’ of R.

1. Singular - Atomic level

The fundamental types of data, which we have seen before. Objects in R are represented as combinations of these. 1. Characters (also known as “strings”) 2. Numerics 3. Integers 4. Logical (also called Boolean)

2. Multiple - Sequential level

Here, we stick together multiple of the SAME atoms to create a linear sequence of them. These would be the ‘molecules’ of R. In R, we call these vectors.

## How can we create empty vectors to store our data?
# 1. Make a character variable? Answer is below.
# 2. Make a numeric variable
# 3. Make an integer variable
# 4. Make a logical variable

charac = character(3)
charac[1] = 'Danny Devito'
charac[2] = 'Quentin Tarantino'
charac[3] = 'Samuel Jackson'

## And how might we do this directly?
char2 = c('Danny Devito', 'Quentin Tarantino', 'Samuel Jackson')

# What if we wanted to add a fourth name, 'Dolly Parton', to our list?

# How do we make a linear sequence of numbers?

# How can we analyze types of data? Two useful functions are "class()" and "typeof()". The key to these is to use them when you're getting unexpected results.

Important to note, a proper vector has only one elemental data-type. If this is true, it allows R to perform operations without causing errors.

If a vector has multiple data-types, then it is called a list. These lists aren’t directly operable - they are containers for other objects.

# The first operation works, it creates a list. The multiplication operation fails, because it can't understanding things that aren't numbers.

brokenList = list(45, 'Ian')
# brokenList*2

# Why does this work?
workingVector = c(45, 90)
result = workingVector*2

# We can create truly insane lists.
strangeList = list(charac, workingVector, brokenList)

3. Higher dimensional objects.

Higher dimensional vectors are matrices (also known as arrays). They have one fundamental data-type, and thus you can perform like matrix-multiplication on them.
Higher dimensional rectangular lists are called data-frames. This is what a spreadsheet tends to be stored as.
Higher dimensional NON-rectangular lists is a generic object (like a Seurat object). It is ultra-flexible and can contain practically anything, but therefore needs special functions to manipulate them.

# Making our own dataframe / spreadsheet
simulateDat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20) # A data-frame

Seurat, the pipeline thus far.

library(dplyr)
library(Seurat)
library(patchwork)
getwd()

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

pbmc # View the object.

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # Get mitochondrial genes.

# Create violin plots for quality control.
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

# Create scatterplots for the distribution of cells and their data-types
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

# Remove cells that have too few features (empty droplets), too many features (doublets), or express too many mitochondrial genes. These cells are the only cells 'worth' doing additional analysis on.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

Seurat: After normalization

Our main goals here are to accomplish:

1. What even is normalizing?
1. Looking at variable features
1. Scaling the data
1. Principal component analysis (PCA)
1. Clustering

Optional section: Deconstructing normalization.

What are the gears of how normalization works in Seurat? First, we must understand why we do it.

Largely, in scRNA-seq, we normalize so that the data from cell 1 becomes relatively comparable with cell 2. That means Seurat will normalize one cell at a time. The baseline log-normalization operates by transforming each detected gene, dividing it by the total RNA in the cell, multiplying by 10,000, and then taking the logarithim.

We do this because a lot of how genes vary (in terms of detection) is because of capture-efficiency and reverse-transcription efficiency.

How is this done? Let’s perform subseting on just one single cell to see how.

# Recreating the normalization procedure. DON'T do this normally. This is just for understanding how these functions really work.
pbmc_tiny <- pbmc.data[,1] # Subset pbmc.data to get just one cell using [,1].
oneCell <- pbmc_tiny[pbmc_tiny > 0] # Subset just the genes that are above zero
length(oneCell) # how many genes are there?

## [1] 781

totalRNA <- sum(oneCell) # Total amount of RNA in the cell
normalizedRNA <- oneCell/totalRNA # normalized each gene to the total extracted RNA.
multipliedRNA <- normalizedRNA*10000 # multiplied by 10000, just for human-readability.
lognormRNA <- log1p(multipliedRNA) # Take the logarithim.
head(lognormRNA) # Normalized values!

##   MRPL20    RPL22 TNFRSF25    EFHD2   NECAP2   AKR7A2 
## 1.635208 1.635208 2.225817 1.635208 1.635208 1.635208

# How does Seurat do it normally?
head(pbmc)

##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
## AAACGCACTGGTAC-1     pbmc3k       2163          781  1.6643551
## AAACGCTGACCAGT-1     pbmc3k       2175          782  3.8160920
## AAACGCTGGTTCTT-1     pbmc3k       2260          790  3.0973451
## AAACGCTGTAGCCA-1     pbmc3k       1275          532  1.1764706
## AAACGCTGTTTCTG-1     pbmc3k       1103          550  2.9011786

pbmc <- NormalizeData(pbmc) # this puts normalized data in pbmc[['RNA']]@data
seuratNormPB <- pbmc[['RNA']]@data # get the normalized data
seuratOne <- seuratNormPB[,1] # subset just one cell
seuratOneCellGenes <- seuratOne[seuratOne > 0] # look only at genes greater than zero.
head(seuratOneCellGenes) # compare this to our results

##   MRPL20    RPL22 TNFRSF25    EFHD2   NECAP2   AKR7A2 
## 1.635873 1.635873 2.226555 1.635873 1.635873 1.635873

Next step: Finding variable features/genes.

For our purposes here, we don’t care about genes that aren’t different between cells. In terms of clustering, we care mostly in looking at genes that vary the most across our cells.

Many genes (house-keeping genes, for example) are equivalently expressed between many cells, so these are removed here.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) # we have control over the amount of features we want to consider "differentially expressed".

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results

plot1 + plot2

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 1 rows containing missing values (geom_point).

Scaling the data

I’d love to get more into scaling data. For the most part, its purpose is to make future computations (especially PCA) simpler for the computer. It is notoriously short in Seurat.

In brief, it has a similar role to standardizing a normal distribution. It moves the mean expression for each gene to zero, and rescales the variance for each gene to 1.

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

## Centering and scaling data matrix

PCA!

PCA, despite being complex, is implemented very easily in Seurat. By default, it runs on the previously calculated variable features.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) # Performs PCA. The second argument isn't even needed.

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
##     FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
##     CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
##     MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
##     CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
##     PLAC8, BLNK, MALAT1, SMIM14, PLD4, IGLL5, P2RX5, LAT2, SWAP70, FCGR2B 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HIST1H2AC, HLA-DPB1, PF4, SDPR 
##     TCL1A, HLA-DPA1, HLA-DRB1, HLA-DQA2, PPBP, HLA-DRA, LINC00926, GNG11, SPARC, HLA-DRB5 
##     GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, CLU, TUBB1, GZMB 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
##     AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
##     LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
## PC_ 5 
## Positive:  GZMB, S100A8, NKG7, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
##     GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, RBP7 
##     CCL5, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, AKR1C3, TTC38, XCL1, HOPX 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
##     LILRB2, PTGES3, HN1, CD27, MAL, GDI2, CD2, CORO1B, ANXA5, TUBA1B 
##     FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, IFITM2

print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, S100A8, NKG7, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

### Visualize the cells along their most variable dimensions

DimPlot(pbmc, reduction = "pca")

Visualize heatmap of the most heterogeneous features between cells.

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE) # We can adjust the dimensions, and the cells, and whether we want an equal of +/- features.

Exploring the dimensionality of the data.

Honestly, I want to gloss over this section. But briefly, we want to get rid of dimensions that don’t contribute useful information. These dimensions are likely characterized by mostly technical noise, rather than biological meaning.

# we can keep these commented out if we want. They're quite slow functions.
# pbmc <- JackStraw(pbmc, num.replicate = 100)
# pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# JackStrawPlot(pbmc, dims = 1:15)

Non-linear clustering.

What are we trying to do? We are looking for how “close” cells are in a higher-dimensional space. This involves copious amounts of graph theory. There are many methods for doing this. The linear-sense is the most obvious, but UMAP/TSNE work in some “non-linear” sense. As in, the normal linear idea of ‘distance’ doesn’t apply.

pbmc <- FindNeighbors(pbmc, dims = 1:10) #

## Computing nearest neighbor graph

## Computing SNN

pbmc <- FindClusters(pbmc, resolution = 0.5) # Previously good values are between 0.4-1.2

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 95840
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8726
## Number of communities: 9
## Elapsed time: 0 seconds

head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                2                3                2                1 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

Finally, UMAP

Again, dramatically simple. We use the previously calculated principal components to create a UMAP object, then DimPlot to plot it.

pbmc <- RunUMAP(pbmc, dims = 1:12) # choose the number of dimensions you want

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

DimPlot(pbmc, reduction = "umap", label = TRUE) # Do we want labels?