Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Data loading

sce.seger <- SegerstolpePancreasData()
## snapshotDate(): 2022-10-31
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## snapshotDate(): 2022-10-31
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
edb <- AnnotationHub()[["AH73881"]]
## snapshotDate(): 2022-10-31
## loading from cache
## require("ensembldb")
symbols <- rowData(sce.seger)$symbol
ens.id <- mapIds(edb, keys=symbols, keytype="SYMBOL", column="GENEID")
## Warning: Unable to map 3305 of 26179 requested IDs.
ens.id <- ifelse(is.na(ens.id), symbols, ens.id)
length(ens.id)
## [1] 26179
# Removing duplicated rows.
keep <- !duplicated(ens.id)
length(keep)
## [1] 26179
sce.seger <- sce.seger[keep,]
rownames(sce.seger) <- ens.id[keep]
length(sce.seger)
## [1] 25454
emtab.meta <- colData(sce.seger)[,c("cell type", "disease",
                                    "individual", "single cell well quality")]
colnames(emtab.meta) <- c("CellType", "Disease", "Donor", "Quality")
colData(sce.seger) <- emtab.meta

sce.seger$CellType <- gsub(" cell", "", sce.seger$CellType)
sce.seger$CellType <- paste0(
  toupper(substr(sce.seger$CellType, 1, 1)),
  substring(sce.seger$CellType, 2))
library(scater)
## 载入需要的程辑包:scuttle
## 载入需要的程辑包:ggplot2
unfiltered <- sce.seger
low.qual <- sce.seger$Quality == "low quality cell"
#dim(low.qual)

Quality control and General Introduction

The scRNA-seq data is stored in the data class with various labels and datasets. The colData and rowData are frequently used. The result of scRNA-seq is an expression matrix with columns as cell labels and rows as gene names. Each column represents the sequencing information of a single-cell library.

For each cell, we calculate these QC metrics using the perCellQCMetrics() function from the scater package (McCarthy et al. 2017). The sum column contains the total count for each cell , the detected column contains the number of detected genes and the altexps_ERCC_percent column contains the percentage of reads mapped to ERCC transcripts(External RNA Controls Consortium). For spike-in external reference transcripts, since the initial amount added to each cell is the same, if the spike-in expression in certain cells is abnormally increased compared to other cells, it may indicate a relatively lower expression of endogenous genes in those cells. Here, we set percent_subsets as “altexps_ERCC_percent”.

We also remove low quality cells that were marked by the authors. We then perform additional quality control as some of the remaining cells still have very low counts and numbers of detected features. For some batches that seem to have a majority of low-quality cells, we use the other batches to define an appropriate threshold via subset=.

In total, there are 1246 cells are discard.

The figure shows the distribution of each QC metric across cells from each donor of the Segerstolpe pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded.

stats <- perCellQCMetrics(sce.seger)
stats_feature <- perFeatureQCMetrics(stats)
summary(stats$sum)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      22    6304  173243  312631  417967 5167219
summary(stats$detected)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      18    1262    4936    4491    7049   12196
summary(stats$altexps_ERCC_percent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4095  0.7433 14.4664  3.1393 99.1578
#sum column contains the total count for each cell and the detected column contains the number of detected genes.
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
                     batch=sce.seger$Donor,
                     subset=!sce.seger$Donor %in% c("HP1504901", "HP1509101"))

sce.seger1 <- sce.seger[,!(qc$discard | low.qual)]
#length(sce.seger)
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc$discard

gridExtra::grid.arrange(
  plotColData(unfiltered, x="Donor", y="sum", colour_by="discard") +
    scale_y_log10() + ggtitle("Total count") +
    theme(axis.text.x = element_text(angle = 90)),
  plotColData(unfiltered, x="Donor", y="detected", colour_by="discard") +
    scale_y_log10() + ggtitle("Detected features") +
    theme(axis.text.x = element_text(angle = 90)),
  plotColData(unfiltered, x="Donor", y="altexps_ERCC_percent",
              colour_by="discard") + ggtitle("ERCC percent") +
    theme(axis.text.x = element_text(angle = 90)),
  ncol=2
)

colSums(as.matrix(qc))
##              low_lib_size            low_n_features high_altexps_ERCC_percent 
##                       788                      1056                      1031 
##                   discard 
##                      1246
#rowData(sce.seger1)
#colData(sce.seger1)

Normalization

They usually stem from technical variations in cDNA capture or PCR amplification efficiency among cells, stemming from challenges in maintaining uniform library preparation with limited starting material. Normalization seeks to eliminate these variations, preventing them from affecting comparisons between cell expression profiles. This guarantees that any identified diversity or differential expression within the cell population originates from biology, not technical biases. In simple words, Normalization is aimed at eliminating differences in cell library sizes.

The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias.

We should notice that Batch effects not only imply more pronounced technical errors between batches but also additional biological-level errors stemming from varying physiological states of sampled cells across different batches, often unpredictable. Thus, a different approach from standardization is needed to correct for these effects which we will mention later.

In this code,we opt to deconvolution method and we don’t normalize the spike-ins at this point as there are some cells with no spike-in counts. The figure illustrate the relationship between the library size factors and the deconvolution size factors in the Segerstolpe pancreas dataset which indicates the normaliztion works well.

library(scran)
clusters <- quickCluster(sce.seger1)
sce.seger1 <- computeSumFactors(sce.seger1, clusters=clusters)
sce.seger <- logNormCounts(sce.seger1) 
summary(sizeFactors(sce.seger))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01406  0.39040  0.70823  1.00000  1.33182 11.18150
plot(librarySizeFactors(sce.seger), sizeFactors(sce.seger), pch=16,
    xlab="Library size factors", ylab="Deconvolution factors", log="xy")

Variance modelling (Feature selection)

In this step, We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.

Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. The premise here is that spike-ins should not be affected by biological variation, so the fitted value of the spike-in trend should represent a better estimate of the technical component for each gene. Here, we choose 2000 high variance genes and store them in chosen.hvgs.

The figure shows that per-gene variance as a function of the mean for the log-expression values in the Grun pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor.

for.hvg <- sce.seger[,librarySizeFactors(altExp(sce.seger)) > 0 & sce.seger$Donor!="AZ"]
dec.seger <- modelGeneVarWithSpikes(for.hvg, "ERCC", block=for.hvg$Donor)
chosen.hvgs <- getTopHVGs(dec.seger, n=2000)
par(mfrow=c(3,3))
blocked.stats <- dec.seger$per.block
for (i in colnames(blocked.stats)) {
    current <- blocked.stats[[i]]
    plot(current$mean, current$total, main=i, pch=16, cex=0.5,
        xlab="Mean of log-expression", ylab="Variance of log-expression")
    curfit <- metadata(current)
    points(curfit$mean, curfit$var, col="red", pch=16)
    curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
}

# Dimensionality reduction Dimensionality reduction aims to reduce the number of separate dimensions in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension.

In this step, we make use of PCA, TSNE and UMAP. TSNE attempts to find a low-dimensional representation of the data that preserves the distances between each point and its neighbors in the high-dimensional space. Unlike PCA, it is not restricted to linear transformations, nor is it obliged to accurately represent distances between distant populations. We compare the three methods and use TSNE for downsteam analysis. We pick first 25 components, not just it is a nice square number, but it can explain most variance as shown in the graph.

library(BiocSingular)
set.seed(101011001)
sce.seger2 <- runPCA(sce.seger, subset_row=chosen.hvgs, ncomponents=25)
sce.seger_old <- runTSNE(sce.seger2, dimred="PCA")
library(scran)
#top.zeisel <- getTopHVGs(dec.zeisel, n=2000)
sce.seger_PCA <- fixedPCA(sce.seger, subset.row=chosen.hvgs) 
reducedDimNames(sce.seger_PCA )
## [1] "PCA"
percent.var <- attr(reducedDim(sce.seger_PCA), "percentVar")
plot(percent.var, log="y", xlab="PC", ylab="Variance explained (%)")

library(scater)
plotReducedDim(sce.seger_PCA, dimred="PCA", colour_by="CellType")

plotReducedDim(sce.seger_PCA, dimred="PCA", colour_by="Donor")

plotReducedDim(sce.seger_old, dimred="TSNE", colour_by="CellType")

plotReducedDim(sce.seger_old, dimred="TSNE", colour_by="Donor")

#plotReducedDim(sce.seger, dimred="TSNE", colour_by="label")
set.seed(1100101001)
sce.seger <- runUMAP(sce.seger_old, dimred="PCA")

plotReducedDim(sce.seger, dimred="UMAP", colour_by="CellType")

plotReducedDim(sce.seger, dimred="UMAP", colour_by="Donor")

Cluster

After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states.

From the heatmap, we can observe some batch effects. For example, the cell from HP150850172D belongs to cluster 3 frequently. HP1507101 seems to belong to cluster 16. The ideal heat map is uniform distribution on all the cluster.

library(bluster)
clust.out <- clusterRows(reducedDim(sce.seger, "PCA"), NNGraphParam(k=30), full=TRUE)
snn.gr <- clust.out$objects$graph
colLabels(sce.seger) <- clust.out$clusters
#colData(sce.seger)
tab <- table(Cluster=colLabels(sce.seger), Donor=sce.seger$Donor)
library(pheatmap)
## Warning: 程辑包'pheatmap'是用R版本4.2.3 来建造的
pheatmap(log10(tab+10), color=viridis::viridis(100))

gridExtra::grid.arrange(
    plotTSNE(sce.seger, colour_by="label"),
    plotTSNE(sce.seger, colour_by="Donor"),
   # plotTSNE(sce.seger, colour_by="CellType"),
    ncol=2
)

gridExtra::grid.arrange(
    plotUMAP(sce.seger, colour_by="label"),
   # plotUMAP(sce.seger, colour_by="CellType"),
    plotUMAP(sce.seger, colour_by="Donor"),
     ncol=2
)

Data integration

library(batchelor)

set.seed(10001010)
corrected <- fastMNN(sce.seger, batch=sce.seger$Donor, subset.row=chosen.hvgs)

set.seed(10000001)
corrected <- runTSNE(corrected, dimred="corrected")

colLabels(corrected) <- clusterRows(reducedDim(corrected, "corrected"), NNGraphParam())

tab <- table(Cluster=colLabels(corrected), Donor=corrected$batch)
tab
##        Donor
## Cluster  AZ HP1502401 HP1504101T2D HP1504901 HP1506401 HP1507101 HP1508501T2D
##       1   3        19            3        11        67         8           78
##       2  14        53           13        19        37        41           20
##       3   2         2            1         1        44         1            1
##       4   2        18            7         3        36         2           28
##       5  29       114          140        72        26       136          121
##       6   8        21            9         6         2         6            6
##       7   1         1            1         9         0         1            2
##       8   2         1            3        10         2         6           12
##       9   4        20           70         8        16         2            8
##        Donor
## Cluster HP1509101 HP1525301T2D HP1526901T2D
##       1        27          124           46
##       2        14           11           70
##       3         0            1            4
##       4         2           23            9
##       5        49           85           96
##       6        11            5           34
##       7         2            2            1
##       8         3           13            4
##       9         1           10           34
gridExtra::grid.arrange(
    plotTSNE(corrected, colour_by="label"),
    plotTSNE(corrected, colour_by="batch"),
    ncol=2
)

corrected <- runUMAP(corrected, dimred="corrected")
gridExtra::grid.arrange(
    plotUMAP(corrected, colour_by="label"),
    plotUMAP(corrected, colour_by="batch"),
  #plotUMAP(corrected, colour_by="CellType")
  ncol=2)

Multi-sample comparisons(Marker gene detection)

We identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. The most straightforward approach to marker gene detection involves testing for differential expression between clusters. If a gene is strongly DE between clusters, it is likely to have driven the separation of cells in the clustering algorithm.

This particular dataset contains both healthy donors and those with type II diabetes. It is thus of some interest to identify genes that are differentially expressed upon disease in each cell type. To keep things simple, we use the author-provided annotation rather than determining the cell type for each of our clusters.

summed <- aggregateAcrossCells(sce.seger, 
    ids=colData(sce.seger)[,c("Donor", "CellType")])
summed
## class: SingleCellExperiment 
## dim: 25454 105 
## metadata(0):
## assays(1): counts
## rownames(25454): ENSG00000118473 ENSG00000142920 ... ENSG00000278306
##   eGFP
## rowData names(2): symbol refseq
## colnames: NULL
## colData names(9): CellType Disease ... CellType ncells
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: endogenous
## altExpNames(0):
summed.beta <- summed[,summed$CellType=="Beta"]

library(edgeR)
## 载入需要的程辑包:limma
## 
## 载入程辑包:'limma'
## The following object is masked from 'package:scater':
## 
##     plotMDS
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
## 
## 载入程辑包:'edgeR'
## The following object is masked from 'package:SingleCellExperiment':
## 
##     cpm
y.beta <- DGEList(counts(summed.beta), samples=colData(summed.beta),
    genes=rowData(summed.beta)[,"symbol",drop=FALSE])
y.beta <- y.beta[filterByExpr(y.beta, group=y.beta$samples$Disease),]
y.beta <- calcNormFactors(y.beta)

design <- model.matrix(~Disease, y.beta$samples)
v.beta <- voomWithQualityWeights(y.beta, design)
fit.beta <- lmFit(v.beta)
fit.beta <- eBayes(fit.beta, robust=TRUE)

res.beta <- topTable(fit.beta, sort.by="p", n=Inf,
    coef="Diseasetype II diabetes mellitus")
head(res.beta)
##                    symbol     logFC   AveExpr         t      P.Value  adj.P.Val
## ENSG00000254647       INS -2.727594 16.679963 -7.670828 3.190592e-06 0.03902413
## ENSG00000137731     FXYD2 -2.595174  7.265291 -6.704722 1.344005e-05 0.08219262
## ENSG00000169297     NR0B1 -2.091911  6.789805 -5.788725 5.810439e-05 0.09916061
## ENSG00000181029   TRAPPC5 -2.127058  7.045811 -5.677974 7.006610e-05 0.09916061
## ENSG00000105707       HPN -1.802989  6.118320 -5.653960 7.298496e-05 0.09916061
## LOC284889       LOC284889 -2.113079  6.651917 -5.514882 9.259279e-05 0.09916061
##                        B
## ENSG00000254647 4.841774
## ENSG00000137731 3.352639
## ENSG00000169297 1.984063
## ENSG00000181029 1.877000
## ENSG00000105707 1.740442
## LOC284889       1.570683
# Easier to just re-run it with plot=TRUE than
# to try to make the plot from 'v.beta'.
voomWithQualityWeights(y.beta, design, plot=TRUE)

## An object of class "EList"
## $genes
##                 symbol
## ENSG00000118473  SGIP1
## ENSG00000142920  AZIN2
## ENSG00000169504  CLIC4
## ENSG00000186094  AGBL4
## ENSG00000157191 NECAP2
## 12226 more rows ...
## 
## $targets
##          group lib.size norm.factors CellType                   Disease
## Sample1      1  8381688    1.0381082     Beta                    normal
## Sample2      1 38093062    0.9407632     Beta                    normal
## Sample3      1  8629679    1.0100997     Beta type II diabetes mellitus
## Sample4      1  5160583    1.0152408     Beta                    normal
## Sample5      1  9773920    0.9285212     Beta                    normal
## Sample6      1  3980716    0.8488683     Beta                    normal
## Sample7      1  3477722    1.0949382     Beta type II diabetes mellitus
## Sample8      1  1093719    0.7828879     Beta                    normal
## Sample9      1  1987070    1.2201832     Beta type II diabetes mellitus
## Sample10     1 21310826    1.2111486     Beta type II diabetes mellitus
##                 Donor Quality sizeFactor label      Donor.1 CellType.1 ncells
## Sample1            AZ      OK         NA     8           AZ       Beta     12
## Sample2     HP1502401      OK         NA     8    HP1502401       Beta     48
## Sample3  HP1504101T2D      OK         NA  <NA> HP1504101T2D       Beta     10
## Sample4     HP1504901      OK         NA     8    HP1504901       Beta     13
## Sample5     HP1506401      OK         NA     1    HP1506401       Beta     32
## Sample6     HP1507101      OK         NA  <NA>    HP1507101       Beta     34
## Sample7  HP1508501T2D      OK         NA  <NA> HP1508501T2D       Beta     13
## Sample8     HP1509101      OK         NA  <NA>    HP1509101       Beta     10
## Sample9  HP1525301T2D      OK         NA     1 HP1525301T2D       Beta     10
## Sample10 HP1526901T2D      OK         NA     1 HP1526901T2D       Beta     64
##          sample.weights
## Sample1       0.2182822
## Sample2       0.4036392
## Sample3       0.5261690
## Sample4       0.1675219
## Sample5       1.7511246
## Sample6       4.0857913
## Sample7       0.9553432
## Sample8       2.6052004
## Sample9       4.0071317
## Sample10      1.8045296
## 
## $E
##                   Sample1  Sample2   Sample3    Sample4  Sample5  Sample6
## ENSG00000118473 -4.067241 3.777831 -4.109307 -3.3675343 4.927808 2.364524
## ENSG00000142920 -4.067241 4.345665  4.025119 -3.3675343 4.414966 4.407851
## ENSG00000169504  5.582015 2.228324 -2.524345  3.0418567 2.930231 1.761859
## ENSG00000186094 -4.067241 3.966501  1.618613  4.3398249 3.111942 1.530534
## ENSG00000157191  5.873807 5.708183  3.556029  0.5393563 5.464279 5.269067
##                   Sample7  Sample8  Sample9 Sample10
## ENSG00000118473  4.669463 2.571196 2.532919 3.702829
## ENSG00000142920  4.361728 2.571196 2.653213 3.617152
## ENSG00000169504 -2.798143 2.958219 3.138640 4.524595
## ENSG00000186094  1.289320 2.040681 2.963553 3.290389
## ENSG00000157191  4.440262 6.795568 5.322240 5.247373
## 12226 more rows ...
## 
## $weights
##            [,1]      [,2]      [,3]       [,4]      [,5]      [,6]      [,7]
## [1,] 0.08535067 0.5464053 0.2020583 0.04916752 0.7606132 1.0606639 0.2292398
## [2,] 0.13030827 0.8052552 0.2726148 0.06911779 1.1919411 1.4251680 0.2756258
## [3,] 0.07695028 0.4787286 0.1740652 0.04543074 0.6806675 0.9915501 0.2096848
## [4,] 0.06408956 0.3622538 0.2114832 0.03968976 0.5589471 0.8806141 0.2354876
## [5,] 0.42796104 1.4915126 0.7954058 0.23617078 3.7495151 4.6645934 0.6684742
##           [,8]      [,9]    [,10]
## [1,] 0.4274795 0.7868755 1.412363
## [2,] 0.5158695 0.9113598 2.039101
## [3,] 0.4059633 0.7251257 1.143386
## [4,] 0.3676914 0.8058561 1.502443
## [5,] 1.0276763 1.7868679 4.613125
## 12226 more rows ...
## 
## $design
##          (Intercept) Diseasetype II diabetes mellitus
## Sample1            1                                0
## Sample2            1                                0
## Sample3            1                                1
## Sample4            1                                0
## Sample5            1                                0
## Sample6            1                                0
## Sample7            1                                1
## Sample8            1                                0
## Sample9            1                                1
## Sample10           1                                1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$Disease
## [1] "contr.treatment"
res.beta2 <- topTable(fit.beta, sort.by="P", n=Inf,
    coef="Diseasetype II diabetes mellitus")
head(res.beta2)
##                    symbol     logFC   AveExpr         t      P.Value  adj.P.Val
## ENSG00000254647       INS -2.727594 16.679963 -7.670828 3.190592e-06 0.03902413
## ENSG00000137731     FXYD2 -2.595174  7.265291 -6.704722 1.344005e-05 0.08219262
## ENSG00000169297     NR0B1 -2.091911  6.789805 -5.788725 5.810439e-05 0.09916061
## ENSG00000181029   TRAPPC5 -2.127058  7.045811 -5.677974 7.006610e-05 0.09916061
## ENSG00000105707       HPN -1.802989  6.118320 -5.653960 7.298496e-05 0.09916061
## LOC284889       LOC284889 -2.113079  6.651917 -5.514882 9.259279e-05 0.09916061
##                        B
## ENSG00000254647 4.841774
## ENSG00000137731 3.352639
## ENSG00000169297 1.984063
## ENSG00000181029 1.877000
## ENSG00000105707 1.740442
## LOC284889       1.570683