This is an R-markdown document for PCL1491 - Introduction to bioinformatics (part 2).
Let’s start by loading all the packages we will need for this session.
library(Seurat) # for most of our work with RNAseq
# install.packages("BiocManager")
# BiocManager::install("org.Mm.eg.db") # as needed
library(org.Mm.eg.db) # gene ontology database for mouse
# BiocManager::install("clusterProfiler") # as needed
library(clusterProfiler) # gene set enrichment analysis package
#BiocManager::install("DESeq2") # as needed
# library(DESeq2) # differential gene-expression test package; may not need to be loaded, but definitely need to be installed to use its test
We ended last time after creating our Seurat object.
Seurat_object <- readRDS("./Seurat_object.rds") # as appropriate
Seurat_object # see the number of genes (features) and cells (samples) in our object - it matches the corresponding dimensions of our initial data
## An object of class Seurat
## 19731 features across 12 samples within 1 assay
## Active assay: RNA (19731 features, 0 variable features)
Recall that both our metadata and our count matrix information, matched by sample names, are stored in the Seurat object.
The first thing people often do with RNAseq data is normalization. We need to do this too. This is important to consider as, for example, we do not want to confound increased gene expression because of biological reasons with increased RNA amplification due to technical variation.
Seurat_object <- NormalizeData(Seurat_object,
normalization.method = "LogNormalize",
scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million
Seurat_object@assays$RNA@data[1:10,1:5] # our normalized data is stored here; preview/see our normalized data
## 10 x 5 sparse Matrix of class "dgCMatrix"
## VA1 VA2 VA3 DA1 DA2
## 0610009B22Rik 3.1296318 3.1526569 3.3373872 3.3546663 3.3355637
## 0610009E02Rik 0.7520341 0.5900264 0.4840317 0.1539926 0.2386973
## 0610009L18Rik 2.8739899 2.7683359 3.1263704 2.8611568 3.1105213
## 0610010F05Rik 2.6940947 2.6699252 2.6531846 2.9345798 2.6539540
## 0610010K14Rik 0.2023383 . 0.3174072 . 0.2386973
## 0610012D04Rik 0.5144907 0.2067902 0.1173568 . .
## 0610025J13Rik 0.7520341 0.7648272 0.8627746 1.1519220 1.1495441
## 0610030E20Rik 2.6152630 2.8312030 3.0167702 2.8321711 2.8115968
## 0610039K10Rik 0.5144907 0.2960868 0.1173568 0.1539926 0.1264539
## 0610040B10Rik 1.1047134 1.3828935 1.4729436 1.6413348 1.3430577
This code/function/action converts our raw count values to counts-per-million (cpm). To demonstrate with mock examples, gene A in sample 1 might have 30 reads (RNA fragments) aligned to it. If sample 1 has 1 million reads detected for it in total, then it would also have 30 cpm for gene A. Gene A in sample 2 might have 60 reads raw, but if 3 million reads were detected for sample 2 in total, then sample 2 would have 20 cpm for gene A. So sample 2 might have greater reads than sample 1 for gene A, but not if we normalize it.
In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data” (under “@assays” and “$RNA”), separate from “counts”, thus allowing us to keep both. This is in contrast to the often-advised-against action of overwriting an object you are working with (which could be hard to reverse).
Note: our initial count matrix for the embryonic data did not consist of integers, indicating that the data may have already been normalized. You have been provided a rounded-data matrix (along with one imputed embryonic sample). What we are doing might thus be double-normalizing. This is ok for demonstration purposes, but will need to be addressed in real-life research situations.
We can now look for differentially expressed genes (DEG’s) between treated and control samples. However, if we do so with both adult and embryonic samples at once, we will likely get confusing or misleading results. Thus we can subset the data first.
Test_object <- subset(Seurat_object, subset = Age == "Embryonic") # subset for adult samples
Now we can move on to differential gene expression testing.
Idents(Test_object) <- "Treatment" # we set "Treatment" as the active identity via "Idents()", this lets Seurat know we're using the treatment information stored in the "Treatment" variable to define our comparison groups
results <- FindMarkers(Test_object, # our data
ident.1 = "20mgkg_at_E65to95", # what group are we looking for DEG's for
logfc.threshold = 1.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
min.pct = .25, # what percent of "20mgkg_at_E65to95" samples do we want to express a gene in order for us to consider testing it
only.pos = F, # do we want only genes that are enriched/greater in "ident.1"
test.use = "DESeq2") # statistical/testing method we want to use
results[1:5,] # see first 5 results
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## Gm17494 4.902975e-18 -2.451796 1 1 9.674060e-14
## Gm13341 1.654806e-17 -1.575678 1 1 3.265098e-13
## Gm11407 1.067264e-14 -1.747157 1 1 2.105818e-10
## Gm14287 2.735312e-13 -6.409391 0 1 5.397043e-09
## Gm5940 8.937170e-13 -1.600335 1 1 1.763393e-08
And we’re done with Seurat! In a real analysis, we would adjust the parameters of FindMarkers based on what kind of DEG’s we’re looking for, what context we’re studying, and more.
We can how take a look at the genes that turn up in our result.
results <- tibble::rownames_to_column(results) # have gene names as a formal column
sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
pos_results <- sig_results[sig_results$avg_log2FC > 0,] # filter for up-regulated genes
We can also visualize some of the gene expression values with plots in Seurat.
VlnPlot(Test_object, features = c("Rpl26", "Kpna2"))
Lastly, we would want to try to get some understanding of what our results mean in aggregate. One general way to do so is via gene set enrichment analysis.
Gene set enrichment analyses essentially estimate the probability and significance that your list of genes overlaps with a list of genes related to function, disease, etc. For our purposes, we will be using the package “clusterProfiler”.
Ref: Wu et al., 2021 : https://pubmed.ncbi.nlm.nih.gov/34557778/
geneList <- pos_results$rowname # get a vector of gene names
# we need to convert our gene names of interest to entrez IDs for use in clusterProfiler
convertedList <- AnnotationDbi::select(org.Mm.eg.db, # reference database
keys = geneList, # the genes we want to convert [1]
columns = c("ENTREZID", "SYMBOL"), # the data we use to convert
keytype = "SYMBOL") # what the format is we have in [1]
## 'select()' returned 1:1 mapping between keys and columns
geneList <- convertedList$ENTREZID # get vector of entrez IDs
geneList <- geneList[!is.na(geneList)] # renove NA's from list
ego <- enrichGO(gene = geneList, # our entrez IDs of interest
OrgDb = org.Mm.eg.db, # reference list
ont = "CC", # what type of ontology we want (Biological Process, Cellular Component, Molecular Function)
pAdjustMethod = "BH", # multiple-comparisons method
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE) # main function for geneset enrichment
result_table <- ego@result
result_table[1:5,1:4] # see first 5 results (first 4 columns)
## ID Description GeneRatio BgRatio
## GO:0032279 GO:0032279 asymmetric synapse 25/680 405/23271
## GO:0099634 GO:0099634 postsynaptic specialization membrane 12/680 136/23271
## GO:0097060 GO:0097060 synaptic membrane 27/680 470/23271
## GO:0014069 GO:0014069 postsynaptic density 24/680 400/23271
## GO:0034703 GO:0034703 cation channel complex 16/680 222/23271
And that’s it! We can save our results in a few different ways. CSV is a common format for saving tables.
write.csv(results, "./results.csv") # note you don't actually need the "./" part here