Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
sce.seger <- SegerstolpePancreasData()
## snapshotDate(): 2022-10-31
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## snapshotDate(): 2022-10-31
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
edb <- AnnotationHub()[["AH73881"]]
## snapshotDate(): 2022-10-31
## loading from cache
## require("ensembldb")
symbols <- rowData(sce.seger)$symbol
ens.id <- mapIds(edb, keys=symbols, keytype="SYMBOL", column="GENEID")
## Warning: Unable to map 3305 of 26179 requested IDs.
ens.id <- ifelse(is.na(ens.id), symbols, ens.id)
length(ens.id)
## [1] 26179
# Removing duplicated rows.
keep <- !duplicated(ens.id)
length(keep)
## [1] 26179
sce.seger <- sce.seger[keep,]
rownames(sce.seger) <- ens.id[keep]
length(sce.seger)
## [1] 25454
emtab.meta <- colData(sce.seger)[,c("cell type", "disease",
"individual", "single cell well quality")]
colnames(emtab.meta) <- c("CellType", "Disease", "Donor", "Quality")
colData(sce.seger) <- emtab.meta
sce.seger$CellType <- gsub(" cell", "", sce.seger$CellType)
sce.seger$CellType <- paste0(
toupper(substr(sce.seger$CellType, 1, 1)),
substring(sce.seger$CellType, 2))
library(scater)
## 载入需要的程辑包:scuttle
## 载入需要的程辑包:ggplot2
unfiltered <- sce.seger
low.qual <- sce.seger$Quality == "low quality cell"
#dim(low.qual)
The scRNA-seq data is stored in the data class with various labels and datasets. The colData and rowData are frequently used. The result of scRNA-seq is an expression matrix with columns as cell labels and rows as gene names. Each column represents the sequencing information of a single-cell library.
For each cell, we calculate these QC metrics using the perCellQCMetrics() function from the scater package (McCarthy et al. 2017). The sum column contains the total count for each cell , the detected column contains the number of detected genes and the altexps_ERCC_percent column contains the percentage of reads mapped to ERCC transcripts(External RNA Controls Consortium). For spike-in external reference transcripts, since the initial amount added to each cell is the same, if the spike-in expression in certain cells is abnormally increased compared to other cells, it may indicate a relatively lower expression of endogenous genes in those cells. Here, we set percent_subsets as “altexps_ERCC_percent”.
We also remove low quality cells that were marked by the authors. We then perform additional quality control as some of the remaining cells still have very low counts and numbers of detected features. For some batches that seem to have a majority of low-quality cells, we use the other batches to define an appropriate threshold via subset=.
In total, there are 1246 cells are discard.
The figure shows the distribution of each QC metric across cells from each donor of the Segerstolpe pancreas dataset. Each point represents a cell and is colored according to whether that cell was discarded.
stats <- perCellQCMetrics(sce.seger)
stats_feature <- perFeatureQCMetrics(stats)
summary(stats$sum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22 6304 173243 312631 417967 5167219
summary(stats$detected)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18 1262 4936 4491 7049 12196
summary(stats$altexps_ERCC_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4095 0.7433 14.4664 3.1393 99.1578
#sum column contains the total count for each cell and the detected column contains the number of detected genes.
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
batch=sce.seger$Donor,
subset=!sce.seger$Donor %in% c("HP1504901", "HP1509101"))
sce.seger1 <- sce.seger[,!(qc$discard | low.qual)]
#length(sce.seger)
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc$discard
gridExtra::grid.arrange(
plotColData(unfiltered, x="Donor", y="sum", colour_by="discard") +
scale_y_log10() + ggtitle("Total count") +
theme(axis.text.x = element_text(angle = 90)),
plotColData(unfiltered, x="Donor", y="detected", colour_by="discard") +
scale_y_log10() + ggtitle("Detected features") +
theme(axis.text.x = element_text(angle = 90)),
plotColData(unfiltered, x="Donor", y="altexps_ERCC_percent",
colour_by="discard") + ggtitle("ERCC percent") +
theme(axis.text.x = element_text(angle = 90)),
ncol=2
)
colSums(as.matrix(qc))
## low_lib_size low_n_features high_altexps_ERCC_percent
## 788 1056 1031
## discard
## 1246
#rowData(sce.seger1)
#colData(sce.seger1)
They usually stem from technical variations in cDNA capture or PCR amplification efficiency among cells, stemming from challenges in maintaining uniform library preparation with limited starting material. Normalization seeks to eliminate these variations, preventing them from affecting comparisons between cell expression profiles. This guarantees that any identified diversity or differential expression within the cell population originates from biology, not technical biases. In simple words, Normalization is aimed at eliminating differences in cell library sizes.
The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias.
We should notice that Batch effects not only imply more pronounced technical errors between batches but also additional biological-level errors stemming from varying physiological states of sampled cells across different batches, often unpredictable. Thus, a different approach from standardization is needed to correct for these effects which we will mention later.
In this code,we opt to deconvolution method and we don’t normalize the spike-ins at this point as there are some cells with no spike-in counts. The figure illustrate the relationship between the library size factors and the deconvolution size factors in the Segerstolpe pancreas dataset which indicates the normaliztion works well.
library(scran)
clusters <- quickCluster(sce.seger1)
sce.seger1 <- computeSumFactors(sce.seger1, clusters=clusters)
sce.seger <- logNormCounts(sce.seger1)
summary(sizeFactors(sce.seger))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01406 0.39040 0.70823 1.00000 1.33182 11.18150
plot(librarySizeFactors(sce.seger), sizeFactors(sce.seger), pch=16,
xlab="Library size factors", ylab="Deconvolution factors", log="xy")
In this step, We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.
Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. The log-transformation is not a variance stabilizing transformation in most cases, which means that the total variance of a gene is driven more by its abundance than its underlying biological heterogeneity. The premise here is that spike-ins should not be affected by biological variation, so the fitted value of the spike-in trend should represent a better estimate of the technical component for each gene. Here, we choose 2000 high variance genes and store them in chosen.hvgs.
The figure shows that per-gene variance as a function of the mean for the log-expression values in the Grun pancreas dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-in transcripts (red) separately for each donor.
for.hvg <- sce.seger[,librarySizeFactors(altExp(sce.seger)) > 0 & sce.seger$Donor!="AZ"]
dec.seger <- modelGeneVarWithSpikes(for.hvg, "ERCC", block=for.hvg$Donor)
chosen.hvgs <- getTopHVGs(dec.seger, n=2000)
par(mfrow=c(3,3))
blocked.stats <- dec.seger$per.block
for (i in colnames(blocked.stats)) {
current <- blocked.stats[[i]]
plot(current$mean, current$total, main=i, pch=16, cex=0.5,
xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(current)
points(curfit$mean, curfit$var, col="red", pch=16)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
}
# Dimensionality reduction Dimensionality reduction aims to reduce the
number of separate dimensions in the data. This is possible because
different genes are correlated if they are affected by the same
biological process. Thus, we do not need to store separate information
for individual genes, but can instead compress multiple features into a
single dimension.
In this step, we make use of PCA, TSNE and UMAP. TSNE attempts to find a low-dimensional representation of the data that preserves the distances between each point and its neighbors in the high-dimensional space. Unlike PCA, it is not restricted to linear transformations, nor is it obliged to accurately represent distances between distant populations. We compare the three methods and use TSNE for downsteam analysis. We pick first 25 components, not just it is a nice square number, but it can explain most variance as shown in the graph.
library(BiocSingular)
set.seed(101011001)
sce.seger2 <- runPCA(sce.seger, subset_row=chosen.hvgs, ncomponents=25)
sce.seger_old <- runTSNE(sce.seger2, dimred="PCA")
library(scran)
#top.zeisel <- getTopHVGs(dec.zeisel, n=2000)
sce.seger_PCA <- fixedPCA(sce.seger, subset.row=chosen.hvgs)
reducedDimNames(sce.seger_PCA )
## [1] "PCA"
percent.var <- attr(reducedDim(sce.seger_PCA), "percentVar")
plot(percent.var, log="y", xlab="PC", ylab="Variance explained (%)")
library(scater)
plotReducedDim(sce.seger_PCA, dimred="PCA", colour_by="CellType")
plotReducedDim(sce.seger_PCA, dimred="PCA", colour_by="Donor")
plotReducedDim(sce.seger_old, dimred="TSNE", colour_by="CellType")
plotReducedDim(sce.seger_old, dimred="TSNE", colour_by="Donor")
#plotReducedDim(sce.seger, dimred="TSNE", colour_by="label")
set.seed(1100101001)
sce.seger <- runUMAP(sce.seger_old, dimred="PCA")
plotReducedDim(sce.seger, dimred="UMAP", colour_by="CellType")
plotReducedDim(sce.seger, dimred="UMAP", colour_by="Donor")
After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states.
From the heatmap, we can observe some batch effects. For example, the cell from HP150850172D belongs to cluster 3 frequently. HP1507101 seems to belong to cluster 16. The ideal heat map is uniform distribution on all the cluster.
library(bluster)
clust.out <- clusterRows(reducedDim(sce.seger, "PCA"), NNGraphParam(k=30), full=TRUE)
snn.gr <- clust.out$objects$graph
colLabels(sce.seger) <- clust.out$clusters
#colData(sce.seger)
tab <- table(Cluster=colLabels(sce.seger), Donor=sce.seger$Donor)
library(pheatmap)
## Warning: 程辑包'pheatmap'是用R版本4.2.3 来建造的
pheatmap(log10(tab+10), color=viridis::viridis(100))
gridExtra::grid.arrange(
plotTSNE(sce.seger, colour_by="label"),
plotTSNE(sce.seger, colour_by="Donor"),
# plotTSNE(sce.seger, colour_by="CellType"),
ncol=2
)
gridExtra::grid.arrange(
plotUMAP(sce.seger, colour_by="label"),
# plotUMAP(sce.seger, colour_by="CellType"),
plotUMAP(sce.seger, colour_by="Donor"),
ncol=2
)
library(batchelor)
set.seed(10001010)
corrected <- fastMNN(sce.seger, batch=sce.seger$Donor, subset.row=chosen.hvgs)
set.seed(10000001)
corrected <- runTSNE(corrected, dimred="corrected")
colLabels(corrected) <- clusterRows(reducedDim(corrected, "corrected"), NNGraphParam())
tab <- table(Cluster=colLabels(corrected), Donor=corrected$batch)
tab
## Donor
## Cluster AZ HP1502401 HP1504101T2D HP1504901 HP1506401 HP1507101 HP1508501T2D
## 1 3 19 3 11 67 8 78
## 2 14 53 13 19 37 41 20
## 3 2 2 1 1 44 1 1
## 4 2 18 7 3 36 2 28
## 5 29 114 140 72 26 136 121
## 6 8 21 9 6 2 6 6
## 7 1 1 1 9 0 1 2
## 8 2 1 3 10 2 6 12
## 9 4 20 70 8 16 2 8
## Donor
## Cluster HP1509101 HP1525301T2D HP1526901T2D
## 1 27 124 46
## 2 14 11 70
## 3 0 1 4
## 4 2 23 9
## 5 49 85 96
## 6 11 5 34
## 7 2 2 1
## 8 3 13 4
## 9 1 10 34
gridExtra::grid.arrange(
plotTSNE(corrected, colour_by="label"),
plotTSNE(corrected, colour_by="batch"),
ncol=2
)
corrected <- runUMAP(corrected, dimred="corrected")
gridExtra::grid.arrange(
plotUMAP(corrected, colour_by="label"),
plotUMAP(corrected, colour_by="batch"),
#plotUMAP(corrected, colour_by="CellType")
ncol=2)
We identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. The most straightforward approach to marker gene detection involves testing for differential expression between clusters. If a gene is strongly DE between clusters, it is likely to have driven the separation of cells in the clustering algorithm.
This particular dataset contains both healthy donors and those with type II diabetes. It is thus of some interest to identify genes that are differentially expressed upon disease in each cell type. To keep things simple, we use the author-provided annotation rather than determining the cell type for each of our clusters.
summed <- aggregateAcrossCells(sce.seger,
ids=colData(sce.seger)[,c("Donor", "CellType")])
summed
## class: SingleCellExperiment
## dim: 25454 105
## metadata(0):
## assays(1): counts
## rownames(25454): ENSG00000118473 ENSG00000142920 ... ENSG00000278306
## eGFP
## rowData names(2): symbol refseq
## colnames: NULL
## colData names(9): CellType Disease ... CellType ncells
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: endogenous
## altExpNames(0):
summed.beta <- summed[,summed$CellType=="Beta"]
library(edgeR)
## 载入需要的程辑包:limma
##
## 载入程辑包:'limma'
## The following object is masked from 'package:scater':
##
## plotMDS
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
##
## 载入程辑包:'edgeR'
## The following object is masked from 'package:SingleCellExperiment':
##
## cpm
y.beta <- DGEList(counts(summed.beta), samples=colData(summed.beta),
genes=rowData(summed.beta)[,"symbol",drop=FALSE])
y.beta <- y.beta[filterByExpr(y.beta, group=y.beta$samples$Disease),]
y.beta <- calcNormFactors(y.beta)
design <- model.matrix(~Disease, y.beta$samples)
v.beta <- voomWithQualityWeights(y.beta, design)
fit.beta <- lmFit(v.beta)
fit.beta <- eBayes(fit.beta, robust=TRUE)
res.beta <- topTable(fit.beta, sort.by="p", n=Inf,
coef="Diseasetype II diabetes mellitus")
head(res.beta)
## symbol logFC AveExpr t P.Value adj.P.Val
## ENSG00000254647 INS -2.727594 16.679963 -7.670828 3.190592e-06 0.03902413
## ENSG00000137731 FXYD2 -2.595174 7.265291 -6.704722 1.344005e-05 0.08219262
## ENSG00000169297 NR0B1 -2.091911 6.789805 -5.788725 5.810439e-05 0.09916061
## ENSG00000181029 TRAPPC5 -2.127058 7.045811 -5.677974 7.006610e-05 0.09916061
## ENSG00000105707 HPN -1.802989 6.118320 -5.653960 7.298496e-05 0.09916061
## LOC284889 LOC284889 -2.113079 6.651917 -5.514882 9.259279e-05 0.09916061
## B
## ENSG00000254647 4.841774
## ENSG00000137731 3.352639
## ENSG00000169297 1.984063
## ENSG00000181029 1.877000
## ENSG00000105707 1.740442
## LOC284889 1.570683
# Easier to just re-run it with plot=TRUE than
# to try to make the plot from 'v.beta'.
voomWithQualityWeights(y.beta, design, plot=TRUE)
## An object of class "EList"
## $genes
## symbol
## ENSG00000118473 SGIP1
## ENSG00000142920 AZIN2
## ENSG00000169504 CLIC4
## ENSG00000186094 AGBL4
## ENSG00000157191 NECAP2
## 12226 more rows ...
##
## $targets
## group lib.size norm.factors CellType Disease
## Sample1 1 8381688 1.0381082 Beta normal
## Sample2 1 38093062 0.9407632 Beta normal
## Sample3 1 8629679 1.0100997 Beta type II diabetes mellitus
## Sample4 1 5160583 1.0152408 Beta normal
## Sample5 1 9773920 0.9285212 Beta normal
## Sample6 1 3980716 0.8488683 Beta normal
## Sample7 1 3477722 1.0949382 Beta type II diabetes mellitus
## Sample8 1 1093719 0.7828879 Beta normal
## Sample9 1 1987070 1.2201832 Beta type II diabetes mellitus
## Sample10 1 21310826 1.2111486 Beta type II diabetes mellitus
## Donor Quality sizeFactor label Donor.1 CellType.1 ncells
## Sample1 AZ OK NA 8 AZ Beta 12
## Sample2 HP1502401 OK NA 8 HP1502401 Beta 48
## Sample3 HP1504101T2D OK NA <NA> HP1504101T2D Beta 10
## Sample4 HP1504901 OK NA 8 HP1504901 Beta 13
## Sample5 HP1506401 OK NA 1 HP1506401 Beta 32
## Sample6 HP1507101 OK NA <NA> HP1507101 Beta 34
## Sample7 HP1508501T2D OK NA <NA> HP1508501T2D Beta 13
## Sample8 HP1509101 OK NA <NA> HP1509101 Beta 10
## Sample9 HP1525301T2D OK NA 1 HP1525301T2D Beta 10
## Sample10 HP1526901T2D OK NA 1 HP1526901T2D Beta 64
## sample.weights
## Sample1 0.2182822
## Sample2 0.4036392
## Sample3 0.5261690
## Sample4 0.1675219
## Sample5 1.7511246
## Sample6 4.0857913
## Sample7 0.9553432
## Sample8 2.6052004
## Sample9 4.0071317
## Sample10 1.8045296
##
## $E
## Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## ENSG00000118473 -4.067241 3.777831 -4.109307 -3.3675343 4.927808 2.364524
## ENSG00000142920 -4.067241 4.345665 4.025119 -3.3675343 4.414966 4.407851
## ENSG00000169504 5.582015 2.228324 -2.524345 3.0418567 2.930231 1.761859
## ENSG00000186094 -4.067241 3.966501 1.618613 4.3398249 3.111942 1.530534
## ENSG00000157191 5.873807 5.708183 3.556029 0.5393563 5.464279 5.269067
## Sample7 Sample8 Sample9 Sample10
## ENSG00000118473 4.669463 2.571196 2.532919 3.702829
## ENSG00000142920 4.361728 2.571196 2.653213 3.617152
## ENSG00000169504 -2.798143 2.958219 3.138640 4.524595
## ENSG00000186094 1.289320 2.040681 2.963553 3.290389
## ENSG00000157191 4.440262 6.795568 5.322240 5.247373
## 12226 more rows ...
##
## $weights
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.08535067 0.5464053 0.2020583 0.04916752 0.7606132 1.0606639 0.2292398
## [2,] 0.13030827 0.8052552 0.2726148 0.06911779 1.1919411 1.4251680 0.2756258
## [3,] 0.07695028 0.4787286 0.1740652 0.04543074 0.6806675 0.9915501 0.2096848
## [4,] 0.06408956 0.3622538 0.2114832 0.03968976 0.5589471 0.8806141 0.2354876
## [5,] 0.42796104 1.4915126 0.7954058 0.23617078 3.7495151 4.6645934 0.6684742
## [,8] [,9] [,10]
## [1,] 0.4274795 0.7868755 1.412363
## [2,] 0.5158695 0.9113598 2.039101
## [3,] 0.4059633 0.7251257 1.143386
## [4,] 0.3676914 0.8058561 1.502443
## [5,] 1.0276763 1.7868679 4.613125
## 12226 more rows ...
##
## $design
## (Intercept) Diseasetype II diabetes mellitus
## Sample1 1 0
## Sample2 1 0
## Sample3 1 1
## Sample4 1 0
## Sample5 1 0
## Sample6 1 0
## Sample7 1 1
## Sample8 1 0
## Sample9 1 1
## Sample10 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$Disease
## [1] "contr.treatment"
res.beta2 <- topTable(fit.beta, sort.by="P", n=Inf,
coef="Diseasetype II diabetes mellitus")
head(res.beta2)
## symbol logFC AveExpr t P.Value adj.P.Val
## ENSG00000254647 INS -2.727594 16.679963 -7.670828 3.190592e-06 0.03902413
## ENSG00000137731 FXYD2 -2.595174 7.265291 -6.704722 1.344005e-05 0.08219262
## ENSG00000169297 NR0B1 -2.091911 6.789805 -5.788725 5.810439e-05 0.09916061
## ENSG00000181029 TRAPPC5 -2.127058 7.045811 -5.677974 7.006610e-05 0.09916061
## ENSG00000105707 HPN -1.802989 6.118320 -5.653960 7.298496e-05 0.09916061
## LOC284889 LOC284889 -2.113079 6.651917 -5.514882 9.259279e-05 0.09916061
## B
## ENSG00000254647 4.841774
## ENSG00000137731 3.352639
## ENSG00000169297 1.984063
## ENSG00000181029 1.877000
## ENSG00000105707 1.740442
## LOC284889 1.570683