Goal

Our goal is to analysis a scRNA sequencing data using Seurat

#Load the Required Libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Seurat)

## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)

## Attaching SeuratObject

library(patchwork)

Extracting count matrices to eventually create Seurat Objects.

The Surat object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For a technical discussion of the Seurat object structure.

pbmc.data <- Read10X(data.dir = "data/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

QC and selecting cells for further analysis

We are first going to check for mitochrondrial DNA contamination. nFeature_RNA is the number of genes detected in each cell nCount_RNA is the total number of molecules detected within a cell. The number of unique genes detected in each cell. Low-quality cells or empty droplets will often have very few genes Cell doublets or multiplets may exhibit an aberrantly high gene count Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

Filtering out cells to use based on QC

filtering the cells that have unique feature counts over 2500 or less than 200 filtering cells that have >5% mitochondrial counts

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

verification for the selected features

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

#Normalizing the data After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

#For clarity, in this previous line of code (and in future commands), the default values for certain parameters in the function call. However, this isn’t required and the same behavior can be achieved with : 

pbmc <- NormalizeData(pbmc)

#view seurat object

pbmc

## An object of class Seurat 
## 13714 features across 2638 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)

#Identification of highly variable features Newt we calculate a subset of features that exhibit cell to cell variantion in the data set. Which helps in the downstream analysis of biological signal in single cell data sets

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results

plot1 + plot2

#Scaling and linear dimensionality reduction Next linear transformation of the data that is a standard pre-processing step prior to dimensional reduction techinques like PCA. Shifts the expression of each gene, so that the mean expression across cells is 0 Scales the expression of each gene, so that the variance across cells is 1 This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate The results of this are stored seurat object

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

## Centering and scaling data matrix

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
##     FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
##     CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
##     MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
##     CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
##     PLAC8, BLNK, MALAT1, SMIM14, PLD4, P2RX5, IGLL5, LAT2, SWAP70, FCGR2B 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HIST1H2AC, HLA-DPB1, PF4, SDPR 
##     TCL1A, HLA-DRB1, HLA-DPA1, HLA-DQA2, PPBP, HLA-DRA, LINC00926, GNG11, SPARC, HLA-DRB5 
##     GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, CLU, TUBB1, GZMB 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
##     AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
##     LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
##     GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, RBP7 
##     CCL5, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
##     LILRB2, PTGES3, MAL, CD27, HN1, CD2, GDI2, CORO1B, ANXA5, TUBA1B 
##     FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB

#PCA/UMAP/t-SNE To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset.

pbmc <- FindNeighbors(pbmc, dims = 1:10)

## Computing nearest neighbor graph

## Computing SNN

pbmc <- FindClusters(pbmc, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 95927
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8728
## Number of communities: 9
## Elapsed time: 0 seconds

# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)

## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                2                3                2                1 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

pbmc <- RunUMAP(pbmc, dims = 1:10)

## 10:10:57 UMAP embedding parameters a = 0.9922 b = 1.112

## 10:10:57 Read 2638 rows and found 10 numeric columns

## 10:10:57 Using Annoy for neighbor search, n_neighbors = 30

## 10:10:57 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 10:10:58 Writing NN index file to temp file /tmp/RtmpXSQ3bi/file212074721a95
## 10:10:58 Searching Annoy index using 1 thread, search_k = 3000
## 10:10:58 Annoy recall = 100%
## 10:10:59 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 10:10:59 Initializing from normalized Laplacian + noise (using irlba)
## 10:10:59 Commencing optimization for 500 epochs, with 105140 positive edges
## 10:11:02 Optimization finished

# individual clusters
DimPlot(pbmc, reduction = "umap")

Finding differentially expressed features

Seurat can help you find markers that define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)

##             p_val avg_log2FC pct.1 pct.2    p_val_adj
## IL32 2.892340e-90  1.2013522 0.947 0.465 3.966555e-86
## LTB  1.060121e-86  1.2695776 0.981 0.643 1.453850e-82
## CD3D 8.794641e-71  0.9389621 0.922 0.432 1.206097e-66
## IL7R 3.516098e-68  1.1873213 0.750 0.326 4.821977e-64
## LDHB 1.642480e-67  0.8969774 0.954 0.614 2.252497e-63

write.csv(cluster2.markers,"markers.csv",row.names = F)

#Feature plots to isolate cell type To cluster the cell type, we apply molecularity optimization techniques such as louvain algorithm or SLM, to interactively group cells together, with the goal of optimizing the standard modularity function.

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))

dot plot

cd_genes=c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
DotPlot(object = pbmc,features = cd_genes)

#RenameIdents to change clusters -> cell types Renaming the default cluster number into the cell type names.

new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

## Verification of cluster

cd_genes=c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
DotPlot(object = pbmc,features = cd_genes)

pbmc.FC <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

## Calculating cluster Naive CD4 T

## Calculating cluster CD14+ Mono

## Calculating cluster Memory CD4 T

## Calculating cluster B

## Calculating cluster CD8 T

## Calculating cluster FCGR3A+ Mono

## Calculating cluster NK

## Calculating cluster DC

## Calculating cluster Platelet

pbmc.FC %>%
    group_by(cluster) %>%
    slice_max(n = 2, order_by = avg_log2FC)

## # A tibble: 18 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster      gene    
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>        <chr>   
##  1 9.57e- 88       1.36 0.447 0.108 1.31e- 83 Naive CD4 T  CCR7    
##  2 3.75e-112       1.09 0.912 0.592 5.14e-108 Naive CD4 T  LDHB    
##  3 0               5.57 0.996 0.215 0         CD14+ Mono   S100A9  
##  4 0               5.48 0.975 0.121 0         CD14+ Mono   S100A8  
##  5 1.06e- 86       1.27 0.981 0.643 1.45e- 82 Memory CD4 T LTB     
##  6 2.97e- 58       1.23 0.42  0.111 4.07e- 54 Memory CD4 T AQP3    
##  7 0               4.31 0.936 0.041 0         B            CD79A   
##  8 9.48e-271       3.59 0.622 0.022 1.30e-266 B            TCL1A   
##  9 5.61e-202       3.10 0.983 0.234 7.70e-198 CD8 T        CCL5    
## 10 7.25e-165       3.00 0.577 0.055 9.95e-161 CD8 T        GZMK    
## 11 3.51e-184       3.31 0.975 0.134 4.82e-180 FCGR3A+ Mono FCGR3A  
## 12 2.03e-125       3.09 1     0.315 2.78e-121 FCGR3A+ Mono LST1    
## 13 3.13e-191       5.32 0.961 0.131 4.30e-187 NK           GNLY    
## 14 7.95e-269       4.83 0.961 0.068 1.09e-264 NK           GZMB    
## 15 1.48e-220       3.87 0.812 0.011 2.03e-216 DC           FCER1A  
## 16 1.67e- 21       2.87 1     0.513 2.28e- 17 DC           HLA-DPB1
## 17 1.92e-102       8.59 1     0.024 2.63e- 98 Platelet     PPBP    
## 18 9.25e-186       7.29 1     0.011 1.27e-181 Platelet     PF4

write.csv(pbmc.FC,"foldchange.csv",row.names = F)

##Export the RDS object

saveRDS(pbmc, file ="PBMC.rds")

basic protocol for RNA analysis using seurat

2023-10-10