single cell RNA seq workflow

Analyzing gene expression in 416B (an immortalized mouse myeloid progenitor cell line) cells

A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences.

Before the analysis, data from respective repository has to be identified, downloaded and loaded into the R in environment.

For this analyis, we obtain data from EMBL with the accession number E-MTAB-5522. The BiocFileCache is a Bioconductor package that makes it easy to download and unzip folders from repositories.

library(BiocFileCache)

  bfc <- BiocFileCache("raw_data", ask = FALSE)
  lun.zip <- bfcrpath(bfc, 
                      file.path("https://www.ebi.ac.uk/arrayexpress/files",
                                "E-MTAB-5522/E-MTAB-5522.processed.1.zip"))
  lun.sdrf <- bfcrpath(bfc, 
                       file.path("https://www.ebi.ac.uk/arrayexpress/files",
                                 "E-MTAB-5522/E-MTAB-5522.sdrf.txt"))
  unzip(lun.zip, exdir=tempdir())

Our data comes in two batches or plates. We need to prepare a count matrix for the two batches and set up the data ready for downstream analysis. Data preparation is a crutial step in any analysis. Inspecting the data, we can see that the first colum of each batch is a variabe with gene length. This is not useful for the anlysis but important in creating metadata for the count matrix. Therefore, this variable has to be removed ffrom the count data. The dataset is then stored as matrix for use in creating single cell experiment object for the Scater package. Individual batches are then merged into one count matrix ; named all.counts in this case.

plate1 <- read.delim(file.path(tempdir(), "counts_Calero_20160113.tsv"), 
                       header=TRUE, row.names=1, check.names=FALSE)
  
plate2 <- read.delim(file.path(tempdir(), "counts_Calero_20160325.tsv"), 
                       header=TRUE, row.names=1, check.names=FALSE)

gene.lengths <- plate1$Length # First row is the gene lenght
plate1 <- as.matrix(plate1[,-1]) #discard the gene leghth as it is not a cell
plate2 <- as.matrix(plate2[,-1])
rbind(plate1 = dim(plate1), plate2 = dim(plate2))

##         [,1] [,2]
## plate1 46703   96
## plate2 46703   96

stopifnot(identical(rownames(plate1), rownames(plate2)))
all.counts <- cbind(plate1, plate2)

The count matrix is stored in a single cell object for further analysis and other information stored as metadat in the same object.

library(SingleCellExperiment)

sce <- SingleCellExperiment(list(counts = all.counts))
rowData(sce)$GeneLength <- gene.lengths

Identify rows corresponding to ERCC spike-in transcripts and store them in the sce object

isSpike(sce, "ERCC") <- grepl("^ERCC", rownames(sce))
summary(isSpike(sce, "ERCC"))

##    Mode   FALSE    TRUE 
## logical   46611      92

This dataset is slightly unusual in that it contains information from another set of spike-in transcripts, the Spike-In RNA Variants (SIRV) set. For simplicity, we will only use the ERCC spike-ins in this analysis. Thus, we must remove the rows corresponding to the SIRV transcripts prior to further analysis, which can be done simply by subsetting the SingleCellExperiment object.

is.sirv <- grepl("^SIRV", rownames(sce))
sce <- sce[!is.sirv, ]
summary(is.sirv)

##    Mode   FALSE    TRUE 
## logical   46696       7

We then have to annotate all the cell with their respective metadata. The row names of the metadata must match the column names of the counts matrix. Otherwise we will have lots of erros and the analysis will not work.

metadata <- read.delim(lun.sdrf, check.names = FALSE, header = TRUE)
m <- match(colnames(sce), metadata[["Source Name"]]) # Enforcing identical order
stopifnot(all(!is.na(m))) #checking that nothing is misssing 
metadata <- metadata[m, ]
head(colnames(metadata))

## [1] "Source Name"                "Comment[ENA_SAMPLE]"       
## [3] "Comment[BioSD_SAMPLE]"      "Characteristics[organism]" 
## [5] "Characteristics[cell line]" "Characteristics[cell type]"

We need to only retain relevant metadata fields to avoid storing unnecessary information in the colData of the sce object. In this case, we will keep the plate of origin (i.e., block) and phenotype of each cell. The second field is relevant as all of the cells contain a CBFB-MYH11 oncogene, but the expression of this oncogene is only induced in a subset of the cells.

colData(sce)$Plate <- factor(metadata[["Factor Value[block]"]])
pheno <- metadata[["Factor Value[phenotype]"]]
levels(pheno) <- c("induced", "control")
colData(sce)$Oncogene <- pheno
table(colData(sce)$Oncogene, colData(sce)$Plate)

##          
##           20160113 20160325
##   induced       48       48
##   control       48       48

Next is the incorporation of gene anotation to the dataset. Feature-counting tools typically report genes in terms of standard identifiers from Ensembl or Entrez. These identifiers are used as they are unambiguous and highly stable. Given the Ensembl identifiers, we obtain the corresponding gene symbols using annotation packages like org.Mm.eg.db for mouse or org.Hm.eg.dp

library(org.Mm.eg.db)

symb <- mapIds(org.Mm.eg.db, keys=rownames(sce), keytype="ENSEMBL", column="SYMBOL")

## 'select()' returned 1:many mapping between keys and columns

rowData(sce)$ENSEMBL <- rownames(sce)
rowData(sce)$SYMBOL <- symb
head(rowData(sce))

## DataFrame with 6 rows and 3 columns
##                    GeneLength            ENSEMBL      SYMBOL
##                     <integer>        <character> <character>
## ENSMUSG00000102693       1070 ENSMUSG00000102693          NA
## ENSMUSG00000064842        110 ENSMUSG00000064842          NA
## ENSMUSG00000051951       6094 ENSMUSG00000051951        Xkr4
## ENSMUSG00000102851        480 ENSMUSG00000102851          NA
## ENSMUSG00000103377       2819 ENSMUSG00000103377          NA
## ENSMUSG00000104017       2233 ENSMUSG00000104017          NA

It is desirable to rename the row names of sce to the gene symbols, as these are easier to interpret. However, this requires some work to account for missing and duplicate symbols. The code below will replace missing symbols with the Ensembl identifier and concatenate duplicated symbols with the (unique) Ensembl identifiers. We also determine the chromosomal location for each gene using the TxDb.Mmusculus.UCSC.mm10.ensGene package. This will be useful later as several quality control metrics will be computed from rows corresponding to mitochondrial genes.

library(scater)

rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ENSEMBL, rowData(sce)$SYMBOL)
head(rownames(sce), n = 10)

##  [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "Xkr4"              
##  [4] "ENSMUSG00000102851" "ENSMUSG00000103377" "ENSMUSG00000104017"
##  [7] "ENSMUSG00000103025" "ENSMUSG00000089699" "ENSMUSG00000103201"
## [10] "ENSMUSG00000103147"

library(TxDb.Mmusculus.UCSC.mm10.ensGene)

location <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys = rowData(sce)$ENSEMBL, 
                   column = "CDSCHROM", keytype = "GENEID")

## 'select()' returned 1:1 mapping between keys and columns

rowData(sce)$CHR <- location
summary(location == "chrM")

##    Mode   FALSE    TRUE    NA's 
## logical   22428      13   24255

QUALITY CONTROL OF THE CELLS

QC is an important aspect during single cell nalayis. Here we shall: *** Define QC metrics *** Remove low quality cells *** Ensure technical effects dont distort the downstream analyis Note that library size, spike ins, mitochondrial RNA are part of qc process. For each cell, we calculate these quality control metrics using the calculateQCMetrics. These are stored in the row- and column-wise metadata of the SingleCellExperiment for future reference.

mito <- which(rowData(sce)$CHR == "chrM")
sce <- calculateQCMetrics(sce, feature_controls = list(mito))
head(colnames(colData(sce)), 10)

##  [1] "Plate"                          "Oncogene"                      
##  [3] "is_cell_control"                "total_features_by_counts"      
##  [5] "log10_total_features_by_counts" "total_counts"                  
##  [7] "log10_total_counts"             "pct_counts_in_top_50_features" 
##  [9] "pct_counts_in_top_100_features" "pct_counts_in_top_200_features"

The distributions of these metrics are shown and stratified by oncogene induction status and plate of origin. The aim is to remove putative low-quality cells that have low library sizes, low numbers of expressed features, and high spike-in (or mitochondrial) proportions. Such cells can interfere with downstream analyses, e.g., by forming distinct clusters that complicate interpretation of the results.

sce$PlateOnco <- paste0(sce$Oncogene, ".", sce$Plate)
multiplot(
  plotColData(sce, y="total_counts", x="PlateOnco"),
  plotColData(sce, y="total_features_by_counts", x="PlateOnco"),
  plotColData(sce, y="pct_counts_ERCC", x="PlateOnco"),
  cols=2)

#Examine how the QC metrics behave with respect to each other 
par(mfrow = c(1, 2))
plot(sce$total_features_by_counts, sce$total_counts/1e6, 
     xlab = "Number of Expressed Genes",
     ylab = "Library Size (millions)")
plot(sce$total_features_by_counts, sce$pct_counts_ERCC, 
     xlab = "Number of Expressed Genes",
     ylab = "ERCC proportion (%)")

Identify outliers for each quality metric

To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells, and identify cells that are outliers for the various QC metrics. Outliers are defined based on the median absolute deviation (MADs) from the median value of each metric across all cells. Remove cells with log-library sizes that are more than 3 MADs below the median log-library size. A log-transformation improves resolution at small values, especially when the MAD of the raw values is comparable to or greater than the median Also remove cells where the log-transformed number of expressed genes is 3 MADs below the median value

libsize.drop <- isOutlier(sce$total_counts, nmads = 3, type = "lower",
                          log = TRUE, batch = sce$PlateOnco)

feature.drop <- isOutlier(sce$total_features_by_counts, nmads = 3, type = "lower",
                          log = TRUE, batch = sce$PlateOnco)

The batch= argument ensures that outliers are identified within each level of the specified plate/oncogene factor. Failing to account for these systematic differences would inflate the MAD estimate and compromise the removal of low-quality cells. We do not need to use the mitochondrial proportions as we already have the spike-in proportions (which serve a similar purpose) for this dataset This avoids potential issues arising from genuine differences in mitochondrial content between cell types that may confound outlier identification.

spike.drop <- isOutlier(sce$pct_counts_ERCC, nmads = 3, type = "higher",
                        batch = sce$PlateOnco)

Subsetting by column will retain only the high-quality cells that pass each filter described above. Examine the number of cells removed by each filter as well as the total number of retained cells. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality.

keep <- !(libsize.drop | feature.drop | spike.drop)
data.frame(ByLibSize = sum(libsize.drop), ByFeature = sum(feature.drop), BySpike = sum(spike.drop), Remaining = sum(keep))

##   ByLibSize ByFeature BySpike Remaining
## 1         5         4       6       183

Subset the SingleCellExperiment object to retain only the putative high-quality cells. We also save the original object to file and save the original object to file for later use.

sce$PassQC <- keep
saveRDS(sce, file = "416B_preQC.rds")
sce <- sce[, keep]
dim(sce)

## [1] 46696   183

# checking whether the automatically selected thresholds are appropriate.
attr(libsize.drop, "thresholds")

##        control.20160113 control.20160325 induced.20160113 induced.20160325
## lower            674264         419448.6         499131.8         461501.1
## higher              Inf              Inf              Inf              Inf

We can classify the cells accordin to their cell cycle stages based on the expression data.

Using a training dataset, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases are chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone function from the scran package. The package contains a pre-trained set of marker pairs for mouse data, which we can load in the the readRDS function. We use the Ensembl identifiers for each gene in our dataset to match up with the names in the pre-trained set of gene pairs.

Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5.

Save these assignments into the SingleCellExperiment object for later use.

set.seed(100)
library(scran)
mm.pairs <- readRDS(system.file("exdata", "mouse_cycle_markers.rds", 
                                package = "scran"))
assignments <- cyclone(sce, mm.pairs, gene.names = rowData(sce)$ENSEMBL)

#plot the cyclone result for each cell
plot(assignments$scores$G1, assignments$scores$G2M, xlab = "G1 score", ylab = "G2/M score", pch = 16)

sce$phases <- assignments$phases
table(sce$phases)

## 
##  G1 G2M   S 
##  98  62  23

Pre-trained classifiers are available in scran for human and mouse data. While the mouse classifier used here was trained on data from embryonic stem cells, it is still accurate for other cell types

Examination of gene level expression metrics by inspecting the most highly expresed genes

fontsize <- theme(axis.text = element_text(size = 12), axis.title = element_text(size = 16))
plotHighestExprs(sce, n  = 50) + fontsize

We also need to filter out the low abundace of zeros in the data set

Several metrics can be used to define low-abundance genes. The most obvious is the average count for each gene, computed across all cells in the dataset. Calculate this using the calcAverage() function, which also performs some adjustment for library size differences between cells. We typically observe a peak of moderately expressed genes following a plateau of lowly expressed genes

ave.counts <- calcAverage(sce, use_size_factors = FALSE)
hist(log10(ave.counts), breaks = 100, 
     main = "",
     col = "grey80",
     xlab = expression(Log[10]~"average count"))

A minimum threshold can be applied to this value to filter out genes that are lowly expressed

demo.keep <- ave.counts >= 1
filtered.sce <- sce[demo.keep, ]
summary(demo.keep)

##    Mode   FALSE    TRUE 
## logical   33490   13206

Examine the number of cells that express each gene. This is closely related to the average count for most genes, as expression in many cells will result in a higher average Genes expressed in very few cells are often uninteresting as they are driven by amplification artifacts (though they may also also arise from rare populations). We could then remove genes that are expressed in fewer than n cells.

num.cells <- nexprs(sce, byrow = TRUE)
smoothScatter(log10(ave.counts), num.cells, 
              ylab = "Number of cells",
              xlab = expression(Log[10]~"average  count"))

Apply these filters at each step rather than applying them globally by subsetting sce. This ensures that the most appropriate filter is used in each application.Nonetheless, we remove genes that are not expressed in any cell to reduce computational work in downstream steps. Such genes provide no information and would be removed by any filtering strategy.

to.keep <- num.cells > 0
sce <- sce[to.keep, ]
summary(to.keep)

##    Mode   FALSE    TRUE 
## logical   22833   23863

Normalization of cell specific biases

*** Using deconvolution method to deal with zero counts *** Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling *** Size factors can be computed with several different approaches, e.g., using the estimateSizeFactorsFromMatrix function in the DESeq2 package or with the calcNormFactors function (Robinson and Oshlack 2010) in the edgeR package However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the size of the counts for accurate size factor estimation Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile.

sce <- computeSumFactors(sce)
summary(sizeFactors(sce))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3464  0.7117  0.9098  1.0000  1.1396  3.5695

The size factors are well-correlated with the library sizes for all cells. This suggests that most of the systematic differences between cells are driven by differences in capture efficiency or sequencing depth

Size factors from deconvolution, plotted against library sizes for all cells in the dataset. Axes are shown on a log-scale. Wild-type cells are shown in black and oncogene-induced cells are shown in red.

plot(sce$total_counts/1e6, sizeFactors(sce), log = "xy",
     xlab = "Library size (millions)",
     ylab = "Size factor", 
     col = c("red", "black")[sce$Oncogene],
     pch = 16)
legend("bottomright", col = c("red", "black"),
       pch = 16, cex = 1.2, 
       legend = levels(sce$Oncogene))

Note: Cell-based QC should always be performed prior to normalization, to remove cells with very low numbers of expressed genes. Otherwise, computeSumFactors() may yield negative size factors for low-quality cells. This is because too many zeroes are present in these cells, reducing the effectiveness of pooling to eliminate zeroes

Compute separate size factor for spike-in transcripts

NOTE: Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification of expression. These size factors are stored in a separate field of the SingleCellExperiment object by setting general.use=FALSE in computeSpikeFactors. This ensures that they will only be used with the spike-in transcripts but not the endogenous genes.

sce <- computeSpikeFactors(sce, type = "ERCC", general.use = FALSE)

*** Apply the size factors to normalize gene expression *** The count data are used to compute normalized log-expression values for use in downstream analyses *** Each value is defined as the log2-ratio of each count to the size factor for the corresponding cell, after adding a prior count of 1 to avoid undefined values at zero counts. *** Division by the size factor ensures that any cell-specific biases are removed. *** If spike-in-specific size factors are present in sce, they will be automatically applied to normalize the spike-in transcripts separately from the endogenous genes.

sce <- normalize(sce)

The log-transformation is useful as it means that any differences in the values directly represent log2-fold changes in expression between cells

MODEL THE TECHNICAL NOISE IN GENE EXPRESSION

Variability in the observed expression values across genes can be driven by genuine biological heterogeneity or uninteresting technical noise.Use the trendVar() function to fit a mean-dependent trend to the variances of the log-expression values for the spike-in transcripts. Set block= to block on the plate of origin for each cell, to ensure that technical differences between plates do not inflate the variances. The biological component of the variance is finally calculated by subtracting the technical component from the total variance of each gene with the decomposeVar function.

var.fit <- trendVar(sce, parametric = TRUE, 
                    block = sce$Plate,
                    loess.args = list(span = 0.3))
var.out <- decomposeVar(sce, var.fit)
head(var.out)

## DataFrame with 6 rows and 6 columns
##                                   mean               total
##                              <numeric>           <numeric>
## ENSMUSG00000103377 0.00807160215928894   0.011921865486065
## ENSMUSG00000103147  0.0346526072192529  0.0722196162535234
## ENSMUSG00000103161 0.00519472222570747 0.00493857699521053
## ENSMUSG00000102331   0.018666093059853   0.032923591860573
## ENSMUSG00000102948   0.059057000132083  0.0881371257735823
## Rp1                 0.0970243712569606    0.45233813529556
##                                    bio               tech
##                              <numeric>          <numeric>
## ENSMUSG00000103377 -0.0238255786157388 0.0357474441018038
## ENSMUSG00000103147  -0.081268086087932  0.153487702341455
## ENSMUSG00000103161 -0.0180705438766402 0.0230091208718507
## ENSMUSG00000102331 -0.0497487337224493 0.0826723255830223
## ENSMUSG00000102948  -0.173441452746905  0.261578578520487
## Rp1                 0.0226096722084225  0.429728463087137
##                               p.value               FDR
##                             <numeric>         <numeric>
## ENSMUSG00000103377                  1                 1
## ENSMUSG00000103147  0.999999999992144                 1
## ENSMUSG00000103161                  1                 1
## ENSMUSG00000102331  0.999999999998056                 1
## ENSMUSG00000102948                  1                 1
## Rp1                0.0354980967632601 0.153727755415776

#visually inspect the trend to confirm that it corresponds to the spike-in variances

plot(var.out$mean, var.out$total, pch = 16, cex = 0.6, 
     xlab = "Mean log- expression",
     ylab = "Variance of log-expression")
curve(var.fit$trend(x), col = "dodgerblue", lwd=2, add = TRUE)
cur.spike <- isSpike(sce)
points(var.out$mean[cur.spike], var.out$total[cur.spike], col = "red", pch = 16)

# Check the distribution of expression values for genes with the largest biological components 
chosen.genes <- order(var.out$bio, decreasing = TRUE)[1:10]

Violin plots of normalized log-expression values for the top 10 genes with the largest biological components in the dataset. Each point represents the log-expression value in a single cell.

plotExpression(sce, features = rownames(var.out)[chosen.genes]) + fontsize

REMOVING BATCH EFFECTS

The data were collected on two plates. Small uncontrollable differences in processing between plates can result in a batch effect, i.e., systematic differences in expression between cells on different plates. Such differences are not interesting and can be removed by applying the removeBatchEffect() function from the limma package

library(limma)

## 
## Attaching package: 'limma'

## The following object is masked from 'package:scater':
## 
##     plotMDS

## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA

assay(sce, "corrected") <- removeBatchEffect(logcounts(sce), 
                                             design = model.matrix(~sce$Oncogene),
                                             batch = sce$Plate)
assayNames(sce)

## [1] "counts"    "logcounts" "corrected"

Manual batch correction is necessary for downstream procedures that are not model-based, e.g., clustering and most forms of dimensionality reduction. However, if an analysis method can accept a design matrix, blocking on nuisance factors in the design matrix is preferable to using removeBatchEffect(). removeBatchEffect() performs a linear regression and sets the coefficients corresponding to the blocking factors to zero.

DENOISE EXPRESSION VALUES USING PCA

Once the technical noise is modelled, we can use principal components analysis (PCA) to remove random technical noise. The denoisePCA() function removes later PCs until the total discarded variance is equal to the sum of technical components for all genes used in the PCA.

sce <- denoisePCA(sce, technical = var.out, assay.type = "corrected")
dim(reducedDim(sce, "PCA"))

## [1] 183  24

The function returns a SingleCellExperiment object containing the PC scores for each cell in the reducedDims slot. The aim is to eliminate technical noise and enrich for biological signal in the retained PCs. This improves resolution of the underlying biology during downstream procedures such as clustering.

sce2 <- denoisePCA(sce, technical = var.fit$trend, 
                   assay.type = "corrected", 
                   value = "lowrank")
assayNames(sce2)

## [1] "counts"    "logcounts" "corrected" "lowrank"

VISUALIZE DATA IN LOW DIMENSIONAL SPACE

With PCA

Visualize the relationships between cells by constructing pairwise PCA plots for the first three components Cells with similar expression profiles should be located close together in the plot, while dissimilar cells should be far apart. We observe a clear separation of cells based on the oncogene induction status, consistent with the expected effects on the transcriptome.

plotReducedDim(sce, use_dimred = "PCA",
               ncomponents = 3,
               colour_by = "Oncogene") + fontsize

#Each point represents a cell, coloured according to oncogene induction status. 

# plot PCA based on batch to see whether the batch correction step was succesful 
plotReducedDim(sce, use_dimred = "PCA",
               ncomponents = 3,
               colour_by = "Plate") + fontsize

Note that plotReducedDim() will use the PCA results that were already stored in sce by denoisePCA(). This allows us to rapidly generate new plots with different aesthetics, without repeating the entire PCA computation. Similarly, plotPCA() will use existing results if they are available in the SingleCellExperiment, and will recalculate them otherwise. Users should set rerun=TRUE to forcibly recalculate the PCs in the presence of existing results. For each visualization method, additional cell-specific information can be incorporated into the size or shape of each point. This is done using the size_by= and shape_by= arguments in most plotting functions.

With t-SNE = t-stochastic neighbour embedding

t-SNE tends to work better than PCA for separating cells in more diverse populations.The former can directly capture non-linear relationships in high-dimensional space, whereas the latter must represent them on linear axes.

set.seed(100)
out5 <- plotTSNE(sce, run_args = list(use_dimred = "PCA",
                                      perplexity = 5),
                 colour_by = "Oncogene") + fontsize + ggtitle("Perplexity = 5")
set.seed(100)
out10 <- plotTSNE(sce, run_args = list(use_dimred = "PCA",
                                       perplexity = 10),
                  colour_by = "Oncogene") + fontsize + ggtitle("Perplexity = 10")
set.seed(100)
out20 <- plotTSNE(sce, run_args = list(use_dimred = "PCA",
                                       perplexity = 20),
                  colour_by = "Oncogene") + fontsize + ggtitle("Perplexity = 20")
multiplot(out5, out10, out20, cols = 3)

# Note: t-SNE is a stochastic method, so users should run the algorithm several times to ensure that the results are representative. 

#Here, we call runTSNE() with a perplexity of 20 to store the t-SNE results inside our SingleCellExperiment object. This avoids repeating the calculations whenever we want to create a new plot with plotTSNE(), as the stored results will be used instead.
# We can set rerun=TRUE to force recalculation in the presence of stored results.

set.seed(100)
sce <- runTSNE(sce, use_dimred = "PCA", perplexity = 20)
reducedDimNames(sce)

## [1] "PCA"  "TSNE"

CLUSTER CELLS INTO PUTATIVE SUBPOPULATIONS

*** Define cell clusterd form expression data *** The denoised log-expression values are used to cluster cells into putative subpopulations *** Perform hierarchical clustering on the Euclidean distances between cells, using Ward’s criterion to minimize the total variance within each cluster.

pcs <- reducedDim(sce, "PCA")
my.dist <- dist(pcs)
my.tree <- hclust(my.dist, method = "ward.D2")

library(dynamicTreeCut)
my.clusters <- unname(cutreeDynamic(my.tree, distM = as.matrix(my.dist),
                                    minClusterSize = 10, verbose = 0))
# Examine distribution of cells in each cluster with repect to known factors 
table(my.clusters, sce$Plate)

##            
## my.clusters 20160113 20160325
##           1       41       39
##           2       19       20
##           3       16       11
##           4       10       14
##           5        5        8

table(my.clusters, sce$Oncogene)

##            
## my.clusters induced control
##           1      80       0
##           2       0      39
##           3       0      27
##           4       0      24
##           5      13       0

#Visualize the cluster assignemnts for all genes on the t-SNE plot
sce$cluster <- factor(my.clusters)
plotTSNE(sce, colour_by = "cluster") + fontsize

Each point represents a cell and is coloured according to the cluster identity to which it was assigned.

Check the separatedness of the clusters using the silhouette width Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters. Conversely, cells with negative widths are closer to other clusters than to other cells in the cluster to which it was assigned.

library(cluster)
clust.col <- scater:::.get_palette("tableau10medium") # hidden scatter colors
sil <- silhouette(my.clusters, dist = my.dist)
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[, 1], sil[, 2])]
sil.cols <- sil.cols[order(-sil[, 1], sil[,3])]
plot(sil, main = paste(length(unique(my.clusters)), "clusters"), 
     border = sil.cols, col = sil.cols, do.col.sort = FALSE)

Above is a Barplot of silhouette widths for cells in each cluster. Each cluster is assigned a colour and cells with positive widths are coloured according to the colour of its assigned cluster. Any cell with a negative width is coloured according to the colour of the cluster that it is closest to. The average width for all cells in each cluster is shown, along with the average width for all cells in the dataset.

The silhouette width can be used to determine the parameter values that maximize the separation between clusters. For example, we could vary the cut height or splitting depth in cutreeDynamic to maximize the average silhouette width across all cells. This usually provides a satisfactory initial clustering for further examination.

DETECTING MARKER GENES BETWEEN CLUSTERS

Once putative subpopulations are identified by clustering, we can identify marker genes for each cluster using the findMarkers function This performs Welch t-tests on the log-expression values for every gene and between every pair of clusters

markers <- findMarkers(sce, my.clusters, block = sce$Plate)

For each cluster, the DE results of the relevant comparisons are consolidated into a single output table. This allows a set of marker genes to be easily defined by taking the top DE genes from each pairwise comparison between clusters. For example, to construct a marker set for cluster 1 from the top 10 genes of each comparison, one would filter marker.set to retain rows with Top less than or equal to 10.

marker.set <- markers[["1"]]
head(marker.set, 10)

## DataFrame with 10 rows and 7 columns
##              Top              p.value                  FDR
##        <integer>            <numeric>            <numeric>
## Aurkb          1 6.65863261824492e-75 1.58362259559721e-70
## Tk1            1 6.41442387689833e-64 3.81385607660686e-60
## Myh11          1 4.95086465211539e-49 9.81220116843835e-46
## Cdca8          1 2.22334940769686e-46  3.5251945975503e-43
## Ccna2          2 1.16841222330534e-68 1.38941739534356e-64
## Rrm2           2 1.48873842020081e-56 5.05809512109088e-53
## Cks1b          2 3.83636139889977e-39 2.40105745131667e-36
## Pirb           2 1.83893803490065e-34 6.15992440620318e-32
## Pimreg         3 7.41004737119548e-68 5.87443855430467e-64
## Pclaf          3  8.9651722100101e-51 2.13218690670669e-47
##                  logFC.2           logFC.3            logFC.4
##                <numeric>         <numeric>          <numeric>
## Aurkb  -7.37163350569914 -6.72799345321135  -1.95039440976238
## Tk1    -4.92754579848056 -7.74065113926215  -3.53749565362853
## Myh11   4.42159318376489  4.30812918035861   4.45235717737968
## Cdca8  -6.84273526334783 -4.88595092732349  -2.43821402084038
## Ccna2   -7.3079756458424  -6.9676852052539  -2.46589325823089
## Rrm2   -5.52120322191947 -7.94685699818751  -3.19173143688883
## Cks1b   -6.6792118199827 -5.92137181826573  -4.37146346518812
## Pirb    5.25803749166673  5.18195596569259   5.87631057587633
## Pimreg -7.30454210816126 -5.91099762335684 -0.874660676141792
## Pclaf  -5.60087985621813 -7.56997893312346  -2.36631043435985
##                   logFC.5
##                 <numeric>
## Aurkb   -6.91128802096519
## Tk1     -4.63516273649457
## Myh11     1.0413149433198
## Cdca8   -7.12791471326961
## Ccna2   -7.12692721843331
## Rrm2    -5.42878091964762
## Cks1b     -6.214592473138
## Pirb   0.0704964218555272
## Pimreg  -7.01798853404623
## Pclaf   -5.16956927698937

head(markers$`1`)

## DataFrame with 6 rows and 7 columns
##             Top              p.value                  FDR
##       <integer>            <numeric>            <numeric>
## Aurkb         1 6.65863261824492e-75 1.58362259559721e-70
## Tk1           1 6.41442387689833e-64 3.81385607660686e-60
## Myh11         1 4.95086465211539e-49 9.81220116843835e-46
## Cdca8         1 2.22334940769686e-46  3.5251945975503e-43
## Ccna2         2 1.16841222330534e-68 1.38941739534356e-64
## Rrm2          2 1.48873842020081e-56 5.05809512109088e-53
##                 logFC.2           logFC.3           logFC.4
##               <numeric>         <numeric>         <numeric>
## Aurkb -7.37163350569914 -6.72799345321135 -1.95039440976238
## Tk1   -4.92754579848056 -7.74065113926215 -3.53749565362853
## Myh11  4.42159318376489  4.30812918035861  4.45235717737968
## Cdca8 -6.84273526334783 -4.88595092732349 -2.43821402084038
## Ccna2  -7.3079756458424  -6.9676852052539 -2.46589325823089
## Rrm2  -5.52120322191947 -7.94685699818751 -3.19173143688883
##                 logFC.5
##               <numeric>
## Aurkb -6.91128802096519
## Tk1   -4.63516273649457
## Myh11   1.0413149433198
## Cdca8 -7.12791471326961
## Ccna2 -7.12692721843331
## Rrm2  -5.42878091964762

Save the list of candidate marker genes for further examination.

write.table(marker.set, file = "416B_marker_1.tsv", sep = "\t", quote = FALSE, col.names = NA)

Visualize the expression profiles of the top candidates to verify that the DE signature is robust

Most of the top markers have strong and consistent up- or downregulation in cells of cluster 1 compared to some or all of the other clusters. A cursory examination of the heatmap indicates that cluster 1 contains oncogene-induced cells with strong downregulation of DNA replication and cell cycle genes. This is consistent with the potential induction of senescence as an anti-tumorigenic response

top.markers <- rownames(marker.set)[marker.set$Top <=10]
plotHeatmap(sce, features = top.markers, columns = order(sce$cluster), 
            colour_columns_by = c("cluster", "Plate", "Oncogene"), 
            cluster_cols = FALSE, 
            center = TRUE, 
            symmetric = TRUE,
            zlim = c(-5, 5))

saveRDS(sce, file = "416B_final_data.rds")