Background

Krishna went over a series of steps involved in processing scRNA-seq data in general. The aims of these steps included loading, filtering, normalizing for differences between cells, visualizing and clustering the data. He used the data from this study that aimed to understand the effects of stromal cells in developing tumors.

To further illustrate and develop the ideas, methods and steps, I will use a subset of cells from this melanoma data - cells marked by CD45- GFP+ CD31- or inferred as cancer-associated fibroblasts at the 5 day and 11 day time-points. The main reason for choosing this subset is so that we have data of a managable size to perform the planned analyses in this section in the alloted time. You should be able to extend the methods/code to data with larger number of cells and/or variables.

Biological question:

  1. Identify the main cell-types (or clusters) in the data using marker genes
  2. Identify a set of genes whose mean (across sampled animals, note: I don’t say sampled cells) expression changes from the 5 day to the 11 day time-point in tumor cancer-associated fibroblast cells given the experiment design.
  3. Identify the clusters (cell-types) for which the proportion of cells from each animal belonging to it is associated with the time-point
  4. Assuming, the data was generated in two different batches, corrected for these effects

Experimental Design: At each time-point (5 day or 11 day) cancer associated fibroblast cells are randomly sampled from two mice that are in turn randomly sampled from a pool of C57BL/6 mice. The expression of all genes within each of the cells are assayed using the SMART-Seq2 protocol.

We are interested in the effect of time on gene expression in cancer-associated fibroblasts. However, the expression of gene in a cell is variable not just because of biological reasons like cell-to-cell (intra-animal) and animal-to-animal (inter-animal) variability but also due to technical reasons like the differences in sequencing depth from cell-to-cell, library preparation, animal handling etc. If we don’t fully account for these sources of variation then our results/interpretation may be incorrect. For example, the clustering of cells may be driven by some techninal factors.

Ideally, the claim we would like to make would be as generalizable as possible, i.e., if somebody else were to repeat the experiment above, go back and randomly sample animals, randomly sample cells from each of these animals at two time-points and sequence the RNA in these cells they would make similar claims. So we would like to demonstrate to a sceptical reviewer that despite all the variability in expression we can claim that the fact that we observe mean expression of a gene at day 11 is x times higher than its expression at day 5 is unlikely to driven by random chance. Therefore, in arriving at our conclusions we would formally need to account for the different sources of variation. We will do go over these four steps in the following code:

  1. Normalization
  2. Identification of marker genes
  3. Multi-sample multi-condition comparison
  4. Batch correction
##remove all data: start from scratch
rm(list = ls())
#Load the libraries
suppressMessages(library(Seurat))
suppressMessages(library(muscat))
suppressMessages(library(SummarizedExperiment))
suppressMessages(library(tidyverse))
suppressMessages(library(magrittr))

#source a function I slightly modified from the muscat package for the within-cluster multi-sample multi-condition comparison
source("pbDS_update.R")

raw_data <- read.csv("rawCounts.csv", header = T)
pheno_data <- read.csv("sub_pheno_data.csv", header = T)
print(dim(raw_data))
## [1] 38293   314
print(dim(pheno_data))
## [1] 313   7
##randomly introduce batch information for illustrating batch correction procedure
##Individuals 1197 and 1235 are assigned to batch 1
##Individuals 1200 and 1242 are assigned to batch 2
batch <- rep("batch1", nrow(pheno_data))
batch[pheno_data$Individual==1200 | pheno_data$Individual==1242] <- "batch2"
pheno_data <- data.frame(pheno_data, batch)

head(pheno_data)
##             X       FacsMarker Individual             InferredCellType Organ
## 1 22028_3_121 CD45- GFP+ CD31-       1197 cancer associated fibroblast  skin
## 2 22028_3_123 CD45- GFP+ CD31-       1197 cancer associated fibroblast  skin
## 3 22028_3_125 CD45- GFP+ CD31-       1197 cancer associated fibroblast  skin
## 4 22028_3_129 CD45- GFP+ CD31-       1197 cancer associated fibroblast  skin
## 5 22028_3_131 CD45- GFP+ CD31-       1197 cancer associated fibroblast  skin
## 6 22028_3_133 CD45- GFP+ CD31-       1197 cancer associated fibroblast  skin
##   SamplingSite  Time  batch
## 1        tumor 5 day batch1
## 2        tumor 5 day batch1
## 3        tumor 5 day batch1
## 4        tumor 5 day batch1
## 5        tumor 5 day batch1
## 6        tumor 5 day batch1

We will map the ensembl ids to gene symbols and load the data as a Seurat object. Seurat provides convenient functions to filter the cells and visualize the data. We will then use the data from the filtered cells for the sctransform normalization and further analyses.

mm10_genes <- read.csv("mm10_genes.tsv", header=FALSE, sep='\t', stringsAsFactors=FALSE,
                       col.names=c("ensembl_id", "gene_symbol"))

gene_ids <- as.character(raw_data$Geneid)

raw_data <- raw_data[,-1]
row.names(raw_data) <- gene_ids

# Map ENSEMBL Ids to their gene symbols
TempIndices <- match(gene_ids, mm10_genes$ensembl_id)
raw_data <- raw_data[!is.na(TempIndices), ]
CheckIds <- row.names(raw_data)[1:5]
NonUniqueGeneSymbols <-  mm10_genes$gene_symbol[TempIndices[!is.na(TempIndices)]]
UniqueGeneSymbols <- paste(NonUniqueGeneSymbols, 1:length(NonUniqueGeneSymbols), sep="_")
row.names(raw_data) <- UniqueGeneSymbols
colnames(raw_data) <- pheno_data$X

row.names(pheno_data) <- as.character(pheno_data$X)
pheno_data <- pheno_data[,-1]

# Finally, wrap this matrix up in a Seurat Object
data <- CreateSeuratObject(counts=raw_data,
                           project="basic_analysis",
                           min.cells=3,
                           min.features=200,
                           names.delim=NULL,
                           meta.data = pheno_data)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
# First, find all mitochondrial genes, and count them as a percentage of total reads/cell
# In mouse, mitochondrial genes start with "mt-" so find all genes that match that pattern
# If you were doing this in a human dataset the pattern would be "^MT-"
data[["percent_mt"]] <- PercentageFeatureSet(object=data, pattern="^mt-")


# Typically, you would use much lower thresholds for mitochondrial genes (< 5%)
# This data set has lots of highly expressed mitochondrial genes though, so we'll leave them
quantnCountRNA <- quantile(data@meta.data$nCount_RNA, 0.05)
data <- subset(x=data, subset=nFeature_RNA > 200 & nCount_RNA > quantnCountRNA & percent_mt < 20)

print(sprintf("After filtering outliers: %d cells and %d genes", ncol(data), nrow(data)))
## [1] "After filtering outliers: 297 cells and 16148 genes"

##Normalization Now, we will perform sctranform based normalization and visualize the results

data <- SCTransform(data, method="qpoisson", vars.to.regress = NULL)
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 15163 by 297
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 297 cells
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
## There are 2 estimated thetas smaller than 1e-07 - will be set to 1e-07
## Found 3 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 15163 genes
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
## Computing corrected count matrix for 15163 genes
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
## Calculating gene attributes
## Wall clock passed: Time difference of 6.057651 secs
## Determine variable features
## Set 3000 variable features
## Place corrected count matrix in counts slot
## Centering data matrix
## Set default assay to SCT
data <- RunPCA(data, verbose = FALSE)
data <- RunTSNE(data, dims = 1:30, verbose = FALSE)

data <- FindNeighbors(data, dims = 1:30, verbose = FALSE)
data <- FindClusters(data, verbose = FALSE)
DimPlot(data, label = TRUE, reduction = "tsne") 

DimPlot(data, reduction = "tsne", group.by = "Individual")

DimPlot(data, reduction = "tsne", group.by = "Time")

DimPlot(data, reduction = "tsne", group.by = "batch")

## Note there still appears to be an association of PC1 with nFeature_RNA
FeatureScatter(object=data, feature1="nFeature_RNA", feature2="PC_1")

FeaturePlot(object = data, features = "nFeature_RNA")

FeaturePlot(object = data, features = "PC_1")

##Find marker genes for each cluster We will find markers using the Wilcoxon two-sample test.

## Find markers for each of the 5 clusters
# MarkersRes <- FindAllMarkers(data, assay = "SCT", slot = "data", test.use = "wilcox", return.thresh = 1e-6)

## Find markers for cluster 0
MarkersRes1 <- FindMarkers(data, ident.1 = 0, assay = "SCT",slot = "data", test.use = "wilcox", return.thresh = 1e-6)
head(MarkersRes1)
##                     p_val avg_logFC pct.1 pct.2    p_val_adj
## Clec3b-23561 1.088436e-46  2.888685 0.964 0.145 1.650396e-42
## Dpep1-22170  2.483023e-38  2.905899 1.000 0.350 3.765008e-34
## Dpt-996      3.900997e-38  2.550161 1.000 0.519 5.915082e-34
## Gsn-11656    5.056213e-38  2.941448 1.000 0.995 7.666735e-34
## Ly6c2-7445   7.344865e-38  2.494238 1.000 0.458 1.113702e-33
## C3-9726      2.406291e-37  2.930319 1.000 0.603 3.648658e-33
## Find genes associated with time in cluster 0.
## Note: this is not the recommended approach to do this since instead of individual mice being treated as replicates, the cells are being treated as such
cluster0_data <- subset(x=data, subset=(seurat_clusters==0))
MarkersRes0 <- FindMarkers(cluster0_data, ident.1 = "5 day", group.by = "Time", assay = "SCT",slot = "data", test.use = "wilcox", return.thresh = 1e-6)
VlnPlot(cluster0_data, features = "Tac1-17705", group.by = "Time")

VlnPlot(cluster0_data, features = "Gm10116-5887", group.by = "Time")

VlnPlot(cluster0_data, features = "Hspa1l-9322", group.by = "Time")

# pb@assays@data[[1]][row.names(pb@assays@data[[1]]) == "Tac1-17705",]
# MarkersRes0[row.names(MarkersRes0)=="Tac1-17705",]
# pb@assays@data[[1]][row.names(pb@assays@data[[1]]) == "Gm10116-5887",]
# temp_ClusterResult <- tbl[[1]]
# temp_topClusterResult <- temp_ClusterResult[temp_ClusterResult$p_adj.loc < 1, ]
# temp_topClusterResult <- temp_topClusterResult[order(temp_topClusterResult$p_adj.loc),]
# temp_topClusterResult <- data.frame(Gene=row.names(temp_topClusterResult), temp_topClusterResult)
# temp_topClusterResult[row.names(temp_topClusterResult)=="Gm10116-5887", ]
# 
# pb@assays@data[[1]][row.names(pb@assays@data[[1]]) == "Hspa1l-9322",]
# temp_ClusterResult <- tbl[[1]]
# temp_topClusterResult <- temp_ClusterResult[temp_ClusterResult$p_adj.loc < 1, ]
# temp_topClusterResult <- temp_topClusterResult[order(temp_topClusterResult$p_adj.loc),]
# temp_topClusterResult <- data.frame(Gene=row.names(temp_topClusterResult), temp_topClusterResult)
# temp_topClusterResult[row.names(temp_topClusterResult)=="Hspa1l-9322", ]


## Find genes associated with time in cluster 1.
## Note: this is not the recommended approach to do this since instead of individual mice being treated as replicates, the cells are being treated as such
cluster1_data <- subset(x=data, subset=(seurat_clusters==1))
MarkersRes1 <- FindMarkers(cluster1_data, ident.1 = "5 day", group.by = "Time", assay = "SCT",slot = "data", test.use = "wilcox", return.thresh = 1e-6)
VlnPlot(cluster1_data, features = "Il1rl1-185", group.by = "Time")

Multi-sample multi-condition comparison

We will aggregate the counts across cells from each mouse within each cluster so that now we will be able to perform a pseudo-bulk RNA-seq differential expression separtely within each cluster. Note, we will be able to treat individual mice as replicates with this analyses. For the analyses in this section, we are moving away from Seurat. Hence we need to create a new single-cell RNA-seq object that the typical bioconductor package will recognise.

Within -cluster comparison

The bioconductor package muscat will help us with this analysis. I had to slightly modify the primary function in this package (pbDS) to work with this data. The modified version (that you have downloaded) is what I sourced at the start of this document. I illustrate below how to set up the design matrix to perform the differential expression analyses. I prefer this approach as opposed to typical approach to assuming a two condition comparison. This way you have a lot of flexibility in modeling more complex design with more than one variable along with interactions between variables of interest.

## Add modified names for the Time and Individual variable to make them work "nice" with the subsequent analysis in this code
data[["sTime"]] <- (data@meta.data$Time) %>%
  as.character() %>%
    gsub(" ", "_", .) %>%
    make.names()
data[["sIndividual"]] <- (data@meta.data$Individual) %>%
  as.character() %>%
  paste0("Individual_",.)

## Store the meta-data for each cell in the PhenoData object
PhenoData <- data@meta.data

## For this analysis we are moving away from Seurat. Hence we need to create a new single-cell RNA-seq object that the typical bioconductor package will recognise.
## Create SingleCellExperiment object
sce <- SummarizedExperiment(assays=list(counts=data@assays$RNA@counts, logcounts=data@assays$RNA@data), colData=PhenoData)
sce <- as(sce, "SingleCellExperiment")

## Prep this object for subsequent aggregation analyses
(sce <- prepSCE(sce, 
                kid = "seurat_clusters", # subpopulation assignments
                gid = "sTime",  # group IDs
                sid = "sIndividual",   # sample IDs 
                drop = TRUE))  # drop all other colData columns
## class: SingleCellExperiment 
## dim: 16148 297 
## metadata(1): experiment_info
## assays(2): counts logcounts
## rownames(16148): Sox17-4 Mrpl15-5 ... Erdr1-24861 Gm21748-24862
## rowData names(0):
## colnames(297): 22028_3_121 22028_3_123 ... 22028_6_97 22028_6_99
## colData names(3): cluster_id sample_id group_id
## reducedDimNames(0):
## altExpNames(0):
nk <- length(kids <- levels(sce$cluster_id))
ns <- length(sids <- levels(sce$sample_id))
names(kids) <- kids; names(sids) <- sids

##keep those clusters with a median of at least 5 cells across all individual replicates
##In a real data set you can change 5 to something more realistic (100?)
toKeep <- table(sce$cluster_id, sce$sample_id) %>% 
  t() %>% 
  apply(., 2, function(x) median(x)) %>% 
  subset(., is_greater_than(., 5)) %>% 
  names()
sce <- subset(sce, , cluster_id %in% toKeep)

## Aggregate counts across cells for each mouse (sample_id) within each cluster (cluster_id)
pb <- aggregateData(sce,
                    assay = "counts", fun = "sum",
                    by = c("cluster_id", "sample_id"))

## Visualize the results in a Multi-Dimensional Scaling (MDS) plot  
(pb_mds <- pbMDS(pb))
## Removing cluster-sample instance(s) '2'-'Individual_1197'

## Set up the design matrix. 
group_id <- colData(pb)[,1]
mm <- model.matrix(~ group_id)
colnames(mm) <- levels(pb$group_id)
# run DS analysis; This is only two group/sample comparison, so we are interested in testing the significance of the seco nd coefficient (that represent the log FC of gene expression between 5 day time-point and the 11 day time-point)

##Note, here I am using the modified version of the function included in the muscat package
res <- pbDS_update(pb, design=mm, coef=c(2),method="edgeR", verbose=TRUE, min_cells = 3)
## Loading required package: edgeR
## Loading required package: limma
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
## 0..1..2..3..4..
## Cluster(s) "2" skipped due to an insufficient number of cells in at least 2 samples per group.
##The results are going to be placed in topClusterResults object
tbl <- res$table
ClusterNo <- 0
p_thresh <- 0.05
topClusterResults <- NULL
for(i in 1:length(tbl)) {
  temp_ClusterResult <- tbl[[i]]
  temp_topClusterResult <- temp_ClusterResult[temp_ClusterResult$p_adj.loc < p_thresh, ]
  temp_topClusterResult <- temp_topClusterResult[order(temp_topClusterResult$p_adj.loc),]
  temp_topClusterResult <- data.frame(Gene=row.names(temp_topClusterResult), temp_topClusterResult)
  topClusterResults <- rbind(topClusterResults, temp_topClusterResult)
}
head(topClusterResults)
##                      Gene cluster_id      logFC    logCPM        F        p_val
## Tac1-17705     Tac1-17705          0  13.335304  5.499646 76.03117 1.735143e-06
## Il1rl1-185     Il1rl1-185          1   3.163877  7.971941 77.44047 4.234654e-06
## Slc40a1-216   Slc40a1-216          1  12.195461  4.481199 88.36110 4.867283e-06
## Ccl20-469       Ccl20-469          1  -9.728543  4.846693 78.12111 4.068238e-06
## Actg2-18270   Actg2-18270          1 -10.090996  4.412121 88.05311 2.342635e-06
## Akr1c18-5209 Akr1c18-5209          4  16.497612 11.043998 43.53791 4.235817e-11
##                 p_adj.loc
## Tac1-17705   2.614513e-02
## Il1rl1-185   1.824623e-02
## Slc40a1-216  1.824623e-02
## Ccl20-469    1.824623e-02
## Actg2-18270  1.824623e-02
## Akr1c18-5209 5.671336e-07

Between -cluster comparison

For this analyses, we will use the lme4 package in R to fit generalized linear mixed effects models. We are going the model the change in the chance (or more formally the odds) of cells from a given mouse belonging to a given cluster from the 5 day to the 11 day time-point. The random effects part of these models captures the inherent correlation between the cells coming from the same mouse

##the number of cells in each mouse in each cluster
Ncells <- as.data.frame(metadata(pb)$n_cells)
print(Ncells)
##    cluster_id       sample_id Freq
## 1           0 Individual_1197   27
## 2           1 Individual_1197   30
## 3           2 Individual_1197    0
## 4           3 Individual_1197    5
## 5           4 Individual_1197    9
## 6           0 Individual_1200   37
## 7           1 Individual_1200   11
## 8           2 Individual_1200    2
## 9           3 Individual_1200    3
## 10          4 Individual_1200   15
## 11          0 Individual_1235    3
## 12          1 Individual_1235   10
## 13          2 Individual_1235   34
## 14          3 Individual_1235   26
## 15          4 Individual_1235    7
## 16          0 Individual_1242   16
## 17          1 Individual_1242   23
## 18          2 Individual_1242   18
## 19          3 Individual_1242   14
## 20          4 Individual_1242    7
##determine the total number of cells per mouse
Ncells <- Ncells %>%
  filter(Freq > 0)
short_Ncells <- Ncells %>% 
  spread(sample_id, Freq)

TotalCells <- colSums(short_Ncells[,-1], na.rm = T)
names(TotalCells) <- colnames(short_Ncells)[-1]

##load the cluster info and the meta-data per mouse
Clusters <- unique(Ncells$cluster_id)
SampleInfo <- colData(pb)

##function to estimate the change in the odds of cluster membership from the 5 day to the 11 day time-point
estimateCellStateChange <- function(k, Ncells, TotalCells, SampleInfo) {
  require(lme4)
  require(gdata)
  print(paste("Cluster", k))
  Ncells_sub <- Ncells %>%
    filter(cluster_id==k)
  FitData <- NULL
  TempIndices <- match(Ncells_sub$sample_id, names(TotalCells))
  TotalCells_sub <- TotalCells[TempIndices]
  for(i in 1:length(TotalCells_sub)) {
    InCluster=c(rep(1, Ncells_sub$Freq[i]), rep(0, (TotalCells_sub[i]-Ncells_sub$Freq[i])))
    Time=rep(SampleInfo$group_id, TotalCells_sub[i])
    ID=rep(row.names(SampleInfo)[i], TotalCells_sub[i])
    TempData <- data.frame(ID, InCluster, Time)
    FitData <- rbind(FitData, TempData)
  }
  FitData$Time <- relevel(FitData$Time, ref="X5_day")
  FitData$InCluster <- as.factor(FitData$InCluster)
  glmerFit1 <- glmer(InCluster ~ (1|ID) + Time, data=FitData, family = "binomial")
  sglmerFit1 <- summary(glmerFit1)
  TempRes1 <- (sglmerFit1$coefficients[-1,])


  return(TempRes1)
}

ClusterRes <- sapply(Clusters, estimateCellStateChange, Ncells, TotalCells, SampleInfo)
## Loading required package: lme4
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## Loading required package: gdata
## Registered S3 method overwritten by 'gdata':
##   method         from  
##   reorder.factor gplots
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
## 
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
## 
## Attaching package: 'gdata'
## The following objects are masked from 'package:dplyr':
## 
##     combine, first, last
## The following object is masked from 'package:purrr':
## 
##     keep
## The following object is masked from 'package:SummarizedExperiment':
## 
##     trim
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:GenomicRanges':
## 
##     trim
## The following object is masked from 'package:IRanges':
## 
##     trim
## The following objects are masked from 'package:S4Vectors':
## 
##     first, first<-
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
## The following object is masked from 'package:stats4':
## 
##     nobs
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
## The following object is masked from 'package:base':
## 
##     startsWith
## [1] "Cluster 0"
## [1] "Cluster 1"
## [1] "Cluster 3"
## [1] "Cluster 4"
## [1] "Cluster 2"
ClusterRes %<>% 
  as.data.frame() %>% 
  t() 
row.names(ClusterRes) <-  paste0("Cluster", Clusters)
ClusterRes <- data.frame(ClusterRes)
colnames(ClusterRes)[c(1,4)] <- c("logOddsRatio_11day_vs_5day","pvalue")

##perform multiple-testing correction
ClusterRes <- data.frame(ClusterRes, p.adjust= p.adjust(ClusterRes$pvalue, method = "BH"))

##output the results
print(ClusterRes)
##          logOddsRatio_11day_vs_5day Std..Error    z.value    pvalue  p.adjust
## Cluster0                -0.08111937  0.1422270 -0.5703515 0.5684394 0.5684394
## Cluster1                -0.11611040  0.1390019 -0.8353154 0.4035401 0.5684394
## Cluster3                -0.16381580  0.1649122 -0.9933519 0.3205385 0.5684394
## Cluster4                -0.12341755  0.1752712 -0.7041517 0.4813383 0.5684394
## Cluster2                -0.22635984  0.1681710 -1.3460100 0.1782993 0.5684394

Batch correction

Below, we will go over code to remove potential batch effects in the data using the IntegrateData function in Seurat 3. We will use the same data as before, assuming that the data was generated in two batches. It was not, :). But lets pretend anyway.

## Let us visualize the uncorrected data
DimPlot(data, reduction = "tsne", group.by = "Individual")

DimPlot(data, reduction = "tsne", group.by = "Time")

DimPlot(data, reduction = "tsne", group.by = "batch")

## Split the cells between the two batches
batch.list <- SplitObject(data, split.by = "batch")

## Normalize the read counts across cells within the same bench
for (i in 1:length(batch.list)) {
  batch.list[[i]] <- SCTransform(batch.list[[i]], verbose = FALSE, method="qpoisson", vars.to.regress = NULL)
}

## Select features to identify anchors
batch.features <- SelectIntegrationFeatures(object.list = batch.list, nfeatures = 3000)
batch.list <- PrepSCTIntegration(object.list = batch.list, anchor.features = batch.features, 
                                    verbose = FALSE)
## Find the anchors
batch.anchors <- FindIntegrationAnchors(object.list = batch.list, normalization.method = "SCT", 
                                           anchor.features = batch.features, verbose = FALSE, k.filter  = 10)
## Warning: UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated
## random numbers without specifying argument '[future.]seed'. There is a risk that
## those random numbers are not statistically sound and the overall results might
## be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'.
## This ensures that proper, parallel-safe random numbers are produced via the
## L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set
## option 'future.rng.onMisuse' to "ignore".
## Integrate the data
batch.integrated <- IntegrateData(anchorset = batch.anchors, normalization.method = "SCT", 
                                     verbose = FALSE)
## Warning: Adding a command log without an assay associated with it
## Let us visualize the "corrected" data
batch.integrated <- RunPCA(batch.integrated, verbose = FALSE)
batch.integrated <- RunTSNE(batch.integrated, dims = 1:30, verbose = FALSE)

batch.integrated <- FindNeighbors(batch.integrated, dims = 1:30, verbose = FALSE)
batch.integrated <- FindClusters(batch.integrated, verbose = FALSE)
DimPlot(batch.integrated, label = TRUE, reduction = "tsne") 

DimPlot(batch.integrated, reduction = "tsne", group.by = "Individual")

DimPlot(batch.integrated, reduction = "tsne", group.by = "Time")

DimPlot(batch.integrated, reduction = "tsne", group.by = "batch")