scRNA-seq for naive and memory T-cells classification

Introduction

Peripheral Blood Mononuclear Cells (PBMC) are a crucial component of the immune system, encompassing lymphocytes and monocytes. Derived from peripheral blood, PBMCs play a central role in immune response, acting as key players in adaptive and innate immunity. In this collaborative group work, we leverage freely available training scPBMC data from 10X Genomics, sourced from the Seurat tutorial datasets. This dataset encompasses 2,700 single cells sequenced on the Illumina NextSeq 500. The cells were annotated with the use of canonical marker genes. Our objective is to employ various clustering methods, both supervised and unsupervised, to categorize these cells into specific types, with a particular focus on distinguishing naive and memory T-cells.

Methods

Data Retrieval and Preprocessing

Raw single-cell RNA sequencing (scRNA-seq) data were obtained from the 10x Genomics PBMC3k dataset here. Upon data extraction, we performed preprocessing steps to address the inherent challenges of scRNA-seq data, characterized by sparsity and zero-inflation.

Cell Type Classification

Utilizing the Uniform Manifold Approximation and Projection (UMAP) reduction clustering method, distinct cell types were identified within the dataset. Particular attention was given to clusters representing naive and memory CD4 T cells, which were selected for downstream analyses.

Unsupervised Clustering

For unsupervised exploration of the scRNA-seq dataset, we conducted clustering analysis using Spearman correlation as a measure of similarity. The correlation matrix was computed using the cor function on the normalized counts obtained from the assay(normCounts). Subsequently, the pheatmap function was employed to generate a heatmap visualizing the sample-to-sample distances based on the Spearman correlation values. To enhance the interpretability of the heatmap, hierarchical clustering was performed on both rows (samples) and columns (samples) using the as.dist(1 - sampleDist) distance metric. This facilitated the identification of potential clusters and patterns within the dataset. The annotation_col parameter was utilized to incorporate cell type information as a color annotation, providing insights into the distribution of cell types across the heatmap. This unsupervised clustering approach complemented other methods such as PCA and heatmaps based on Pearson correlation, collectively contributing to a comprehensive understanding of the inherent structure and relationships within the scRNA-seq data.

Supervised Clustering (sPLS-DA)

We prepared the data, ensuring consistency in the order of samples between the combined expression table and metadata. Subsequently, we constructed data matrices X and Y, representing transposed gene expression data and corresponding cell types, respectively. Employing the sPLS-DA algorithm, we modeled the discriminatory patterns between cell types and visualized the outcomes through individual sample and variable plots. AUROC analysis was performed to assess the model’s ability to distinguish between cell types for each component. We further delved into the interpretation of the model by extracting variables, shedding light on the importance of individual genes in the discriminatory process. To refine the outcome of the model, we iterated the analysis by increasing the number of variables in each component, illustrating the impact of variable selection on discrimination performance. Additionally, we explored the sPLS-DA model using the entire set of available genes, to investigate insights into the discriminatory potential of the complete gene set.

Differential gene expression

The differential gene expression analysis was conducted in several steps. Firstly, a design matrix was defined to model the relationship between different cell types, where the formula ~ 0 + cell_type was used to exclude the intercept term. Subsequently, a contrast matrix was created to specifically compare memory and naive cell types. The DGEList function was employed to create a count matrix from the pre-processed data, and normalization factors were calculated using calcNormFactors. Dispersion estimation was performed using estimateDisp with a specified design matrix, considering robust methods. Following these steps, a generalized linear model was fitted using glmQLFit, and hypothesis testing was executed using glmQLFTest with the defined contrast matrix. The results were further filtered to identify significantly differentially expressed genes based on false discovery rate (FDR) and log fold change thresholds. This comprehensive methodology allowed for a thorough exploration of gene expression variations between memory and naive cell types.

Note: One of the student in out grooup delved into the application of neural networks for this analysis. However, as the corresponding code is not yet available, and I am relatively inexperienced in this particular area, the details of this exploration are not included in the current report. I look forward to reading the code once provided and expanding my knowledge on this topic in the future.

Code Availability

The corresponding codes for each step, is provided below:

#loading packages 
library(Seurat)

## Loading required package: SeuratObject

## Loading required package: sp

## 'SeuratObject' was built under R 4.3.0 but the current version is
## 4.3.2; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed

## 
## Attaching package: 'SeuratObject'

## The following object is masked from 'package:base':
## 
##     intersect

library(edgeR)

## Loading required package: limma

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(DESeq2)

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following object is masked from 'package:limma':
## 
##     plotMA

## The following object is masked from 'package:SeuratObject':
## 
##     intersect

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:dplyr':
## 
##     first, rename

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice

## The following object is masked from 'package:sp':
## 
##     %over%

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following object is masked from 'package:dplyr':
## 
##     count

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians

## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

## 
## Attaching package: 'SummarizedExperiment'

## The following object is masked from 'package:Seurat':
## 
##     Assays

## The following object is masked from 'package:SeuratObject':
## 
##     Assays

library(pheatmap)
library(RColorBrewer)
library(ggplot2)
library(ggrepel)
library(mixOmics)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: lattice

## 
## Loaded mixOmics 6.26.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')

#load data
setwd("~/Documents/PhD-Bioinformatics-EpiCass/PhDCourses/StatCourse/Day3/GroupWork/filtered_gene_bc_matrices/")
dt <- Read10X(data.dir = "./hg19/")

pbmc <- CreateSeuratObject(counts = dt, project = "pbmc3k", min.cells = 3, min.features = 200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

# Clustering
pbmc <- NormalizeData(pbmc)

## Normalizing layer: counts

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

## Finding variable features for layer counts

pbmc <- ScaleData(pbmc)

## Centering and scaling data matrix

pbmc <- RunPCA(pbmc, verbose = FALSE)

pbmc <- FindNeighbors(pbmc, dims = 1:10)

## Computing nearest neighbor graph

## Computing SNN

pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2700
## Number of edges: 97892
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8719
## Number of communities: 9
## Elapsed time: 0 seconds

pbmc <- RunUMAP(pbmc, dims = 1:10)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 12:44:08 UMAP embedding parameters a = 0.9922 b = 1.112

## Found more than one class "dist" in cache; using the first, from namespace 'spam'

## Also defined by 'BiocGenerics'

## 12:44:08 Read 2700 rows and found 10 numeric columns

## 12:44:08 Using Annoy for neighbor search, n_neighbors = 30

## Found more than one class "dist" in cache; using the first, from namespace 'spam'

## Also defined by 'BiocGenerics'

## 12:44:08 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 12:44:08 Writing NN index file to temp file /var/folders/kv/4c74z_qd7tv9b_nw62zdzczh0000gn/T//Rtmp7JVqBL/file17a13dd3d5cb
## 12:44:08 Searching Annoy index using 1 thread, search_k = 3000
## 12:44:09 Annoy recall = 100%
## 12:44:09 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 12:44:10 Initializing from normalized Laplacian + noise (using RSpectra)
## 12:44:10 Commencing optimization for 500 epochs, with 107868 positive edges
## 12:44:15 Optimization finished

# Visualize clusters

DimPlot(pbmc, reduction = "umap")

#naming the clusters
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
                     "NK", "DC", "Platelet")

names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

pbmc$cell_type <- pbmc@active.ident

cd4 <- subset(pbmc, subset = cell_type %in% c("Naive CD4 T", "Memory CD4 T"))

DimPlot(cd4, reduction = "umap")

#subset
cd4_n <- subset(cd4, subset = cell_type == "Naive CD4 T")@assays$RNA$counts

cd4_m <- subset(cd4, subset = cell_type == "Memory CD4 T")@assays$RNA$counts

####dataframe
cd4_n_df <- as.data.frame(as.matrix(cd4_n))
cd4_m_df <- as.data.frame(as.matrix(cd4_m))

#combind 
combind.df.n.m <- cbind(cd4_n_df,cd4_m_df)

#remove non zero 
combined.df.no.zeros <- combind.df.n.m %>%
  filter(rowSums(.) != 0)

#####
meanLog2CPM <- rowMeans(log2(cpm(combined.df.no.zeros) + 1)) 

combined.df.no.zeros <- combined.df.no.zeros[meanLog2CPM > 0.1, ]

ggplot()+
  geom_histogram(aes(meanLog2CPM ), binwidth = .01) +
  geom_vline(xintercept = .1, color = "red")

#### Get samples tag for the dataframes
sample_n <- data.frame(cell = colnames(cd4_n), cell_type = "naive") 
sample_m <- data.frame(cell = colnames(cd4_m), cell_type = "memory")
samples <- rbind(sample_n, sample_m)

dds <- DESeqDataSetFromMatrix(as.matrix(combined.df.no.zeros),
                              design = ~ 0 + cell_type ,
                              colData = samples)

## converting counts to integer mode

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

#Normalize
normCounts <- varianceStabilizingTransformation(dds, blind = TRUE)

## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.

#Heatmap plot
sampleDist <- cor(assay(normCounts), method = "spearman") 
sampleColor <- brewer.pal(3, "Accent")[1:2] 
names(sampleColor) <- levels(as.factor(samples$cell_type))

pheatmap(sampleDist,
         clustering_distance_rows = as.dist(1 - sampleDist), 
         clustering_distance_cols = as.dist(1 - sampleDist), 
         annotation_col = data.frame(CellTypes = samples$cell_type,
                                     row.names = samples$cell),
         annotation_colors = list(CellTypes = sampleColor),
         show_colnames = FALSE,
         show_rownames = FALSE
         )

##PCA plot
pcaRes <- prcomp(t(assay(normCounts)))
varExp <- round(pcaRes$sdev^2 / sum(pcaRes$sdev^2) * 100) 
pcaDF <- data.frame(
  PC1 = pcaRes$x[, 1],
  PC2 = pcaRes$x[, 2],
  Storage = samples$cell_type,
  Sample = samples$cell)
pcaPlot <- ggplot( 
  data = pcaDF,
  mapping = aes(x = PC1, y = PC2, color = Storage, label = Sample)) + 
  geom_point(size = 3) +
  geom_text_repel(size = 4) +
  labs(x = paste0("PC1 (", varExp[1], " %)"), y = paste0("PC2 (", varExp[2], " %)")) + 
  theme_minimal() +
  theme(axis.text = element_text(size = 12), legend.text = element_text(size = 10)) +
  scale_color_manual(values = brewer.pal(3, "Accent")) 
print(pcaPlot)

## Warning: ggrepel: 1151 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

X <- t(combined.df.no.zeros)
Y <- samples$cell_type

# Fit sPLS-DA model

splsda.result <- splsda(X, Y, keepX = c(50,50), ncomp = 2) 
plotIndiv(splsda.result, ind.names = FALSE, legend = TRUE,) # plot the samples

#plotVar(splsda.result)

auroc(splsda.result, roc.comp = 1, print = FALSE) # AUROC for the first component

auroc(splsda.result, roc.comp = 2, print = F)  # AUROC for two components

# depict weight assigned to each of these variables

plotLoadings(splsda.result, comp =1, method = 'mean', contrib = 'max')

plotLoadings(splsda.result, comp =2, method = 'mean', contrib = 'max')

# Running it again with 500 variables in each component. The default is two components.

splsda.result <- splsda(X, Y, keepX = c(500, 500)) # run the method
plotIndiv(splsda.result, ind.names = FALSE) # plot the samples

auroc(splsda.result, roc.comp = 1, print = FALSE) # AUROC for the first component

auroc(splsda.result, roc.comp = 2, print = F)  # AUROC for two components

# depict weight assigned to each of these variables
plotLoadings(splsda.result, method = 'mean', contrib = 'max')

plotLoadings(splsda.result, comp =2, method = 'mean', contrib = 'max')

# Running the model with all the genes assigned to both components

splsda.result <- splsda(X, Y) # run the method
plotIndiv(splsda.result, ind.names = FALSE) # plot the samples

plotVar(splsda.result)

auroc(splsda.result, roc.comp = 1, print = FALSE) # AUROC for the first component

auroc(splsda.result, roc.comp = 2, print = F)  # AUROC for two components

# depict weight assigned to each of these variables
plotLoadings(splsda.result, method = 'mean', contrib = 'max')

plotLoadings(splsda.result, comp =2, method = 'mean', contrib = 'max')

#Differential gene expression analysis
#Step 1: Define design matrix
designMatrix <- model.matrix(~ 0 + cell_type, 
                             data = samples) 

#Step 2: Define contrast matrix
contrastMatrix <- makeContrasts(cell_typememory - cell_typenaive, levels = designMatrix)


#Step 3: Fit model
dge <- DGEList(combined.df.no.zeros)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge, designMatrix, robust = TRUE) 
fit <- glmQLFit(dge, designMatrix, robust = TRUE)

#Step 4: Perform hypothesis testing
res <- glmQLFTest(fit, contrast = contrastMatrix)
res <- topTags(res, n = nrow(combined.df.no.zeros))
sigRes <- subset(res$table, FDR < 0.05 & abs(logFC) > 1)

#Visualize results
volcanoPlot <- ggplot(res$table, 
                      aes(x = logFC, y = -log10(FDR),
                          color = ifelse(FDR < 0.05 & abs(logFC) > 1, "darkred", "grey"))) + 
  geom_point() +
  xlab(expression("Fold Change, Log"[2]*"")) +
  ylab(expression("Adjusted P value, Log"[10]*"")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dotted", linewidth = 1) + 
  geom_hline(yintercept = -log10(0.05), linetype = "dotted", linewidth = 1) + 
  theme_minimal() +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("darkred", "grey", "steelblue")) + 
  geom_text_repel(aes(x = logFC, y = -log10(FDR), label = rownames(res$table)[1:10],
                      size = 2, color = "steelblue"),
                  data = res$table[1:10, ])

print(volcanoPlot)

Conclusion

Our exploration of the scRNA-seq dataset as shown, expected similarity between naive and memory T cells in these immune cells. The implementation of various clustering methods, unsupervised methods including heatmaps and Principal Component Analysis (PCA), exhibited challenges in the initial separation of these cell types. Notably, it was the supervised clustering approach, sPLS-DA, that demonstrated superior discriminatory power, successfully capturing the similar distinctions between naive and memory T cells. Our study however, didn’t consider the sparsity of the dataset. We recommend exploring zero-inflated models and other advanced statistical methods tailored for handling sparse datasets. Techniques such as zero-inflated negative binomial models can better accommodate the inherent sparsity and zero-inflation characteristic of scRNA-seq data. Furthermore, we identification of 10 differentially expressed genes (DEGs) that displayed statistical significance, as depicted in the volcano plot. These DEGs, exhibiting notable variations in expression levels, hold potential key regulators in distinguishing between naive and memory T cell populations.

Reflections

While these techniques were previously unfamiliar to me, my background has predominantly involved genome assemblies and comparative genomics. Nevertheless, understanding these statistical methods is crucial for the second phase of my PhD and holds future relevance, given that statistical analysis is an integral component of my bioinformatics career. This knowledge becomes particularly relevant as I transition into second objective in handling methylation data, where statistical expertise is essential for drawing meaningful conclusions regarding significant methylation calls in my research area.

Individual report - Stats course 2023

Michael Landi

29th Nov 2023