harmony batch effect

Herein, we will utilise the Signac R package’s Harmony batch effect correction method (Korsunsky et al. 2019). The Harmony algorithm is accessible on GitHub, and Signac provides integration tutorials.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Seurat)

## Attaching SeuratObject

library(harmony)

## Loading required package: Rcpp

library(Signac)
library(GenomicRanges)

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:dplyr':
## 
##     first, rename

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice

## Loading required package: GenomeInfoDb

pbmcv2 <- readRDS("/mnt/nectar_volume/home/eraz0001/new/pbmcv2.rds")
pbmcv1p1 <- readRDS("/mnt/nectar_volume/home/eraz0001/new/pbmcv1p1.rds")

head(pbmcv1p1)

##                       orig.ident nCount_peaks nFeature_peaks
## AAACGAAAGAACGCCA-1 SeuratProject        56436          20043
## AAACGAAAGACAGCTG-1 SeuratProject         7096           3436
## AAACGAAAGAGTAAGG-1 SeuratProject        10440           4848
## AAACGAAAGATGCGCA-1 SeuratProject        25283          10345
## AAACGAAAGGCAGATC-1 SeuratProject        10896           5057
## AAACGAAAGTCATACC-1 SeuratProject         6528           3192
## AAACGAACACGCGTTG-1 SeuratProject        17260           7419
## AAACGAACACGGCCAT-1 SeuratProject        11904           5426
## AAACGAACACTCAAGT-1 SeuratProject         9729           4644
## AAACGAACAGCGTCGT-1 SeuratProject        17572           7556

Add ‘DataSet’ info into the seurat obj.

pbmcv1p1$dataset <- 'pbmcv1p1'
pbmcv2$dataset <- 'pbmcv2'

Step 2: Merge objects Use the merge function to combine the objects without batch correction, and visualize them using the head and tail functions.

unintegrated <- merge(x = pbmcv1p1, y = pbmcv2, add.cell.ids = c("pbmcv1p1", "pbmcv2"))

## 
## Binding matrix rows
## 
## Binding matrix rows
## 
## Binding matrix rows
## 
## Binding matrix rows

Next, we visualize the data before batch correction.

unintegrated <- RunTFIDF(unintegrated)

## Performing TF-IDF normalization

unintegrated <- FindTopFeatures(unintegrated, min.cutoff = 20)
unintegrated <- RunSVD(unintegrated)

## Running SVD

## Scaling cell embeddings

unintegrated <- RunUMAP(unintegrated, dims = 2:50, reduction = 'lsi')

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 12:23:10 UMAP embedding parameters a = 0.9922 b = 1.112

## 12:23:10 Read 20882 rows and found 49 numeric columns

## 12:23:10 Using Annoy for neighbor search, n_neighbors = 30

## 12:23:10 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 12:23:13 Writing NN index file to temp file /tmp/RtmpUSJvea/file2da0fd9297f08
## 12:23:13 Searching Annoy index using 1 thread, search_k = 3000
## 12:23:23 Annoy recall = 100%
## 12:23:24 Commencing smooth kNN distance calibration using 1 thread
## 12:23:27 Initializing from normalized Laplacian + noise
## 12:23:28 Commencing optimization for 200 epochs, with 870980 positive edges
## 12:23:40 Optimization finished

DimPlot(unintegrated, group.by = 'dataset', pt.size = 0.1)

Given that these are both human PMBCs, one might expect better overlap between the two samples. We suspect that chemistry batch effects could cause the lack of overlap. Step 4: Batch correction using Harmony In this step, use Harmony to correct the batch effects between the two samples.

hm.integrated <- RunHarmony(object = unintegrated, group.by.vars = 'dataset', reduction = 'lsi', assay.use = 'peaks', project.dim = FALSE)

## Harmony 1/10

## Harmony 2/10

## Harmony 3/10

## Harmony 4/10

## Harmony 5/10

## Harmony 6/10

## Harmony 7/10

## Harmony 8/10

## Harmony 9/10

## Harmony 10/10

hm.integrated <- RunUMAP(hm.integrated, dims = 2:30, reduction = 'harmony')

## 12:24:57 UMAP embedding parameters a = 0.9922 b = 1.112

## 12:24:57 Read 20882 rows and found 29 numeric columns

## 12:24:57 Using Annoy for neighbor search, n_neighbors = 30

## 12:24:57 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 12:25:00 Writing NN index file to temp file /tmp/RtmpUSJvea/file2da0fd69dae4d2
## 12:25:00 Searching Annoy index using 1 thread, search_k = 3000
## 12:25:10 Annoy recall = 100%
## 12:25:11 Commencing smooth kNN distance calibration using 1 thread
## 12:25:13 Initializing from normalized Laplacian + noise
## 12:25:14 Commencing optimization for 200 epochs, with 886886 positive edges
## 12:25:26 Optimization finished

DimPlot(hm.integrated, group.by = 'dataset', pt.size = 0.1)

After batch correction in Harmony, note that the two PBMC samples largely overlap.

harmony batch effect

Ehsan

05/05/2022