This is an R-markdown document for PCL368 - Introduction to bioinformatics (part 2).
Let’s start by loading all the packages we will need for this session - revisit part 1 for installation code.
library(Seurat) # for most of our work with RNAseq
library(org.Mm.eg.db) # gene ontology database for mouse
library(clusterProfiler) # gene set enrichment analysis package
#library(DESeq2) # differential gene-expression test package; may not need to be loaded, but definitely need to be installed to use its test
We ended last time after creating our Seurat object.
expt_metadata <- read.csv("./Mouse_data/combined_metadata.csv", row.names=1) # our metadata (needed for published Rmd)
row.names(expt_metadata) <- expt_metadata$Sample # set sample names to row names (needed for published Rmd)
expt_matrix <- read.csv("./Mouse_data/combined_count_matrix.csv", row.names=1) # our count matrix (needed for published Rmd)
row.names(expt_matrix) <- expt_matrix$Gene_name # set gene names to row names (needed for published Rmd)
expt_matrix <- as.matrix(expt_matrix[,-1]) # remove the first column of gene names, to leave us with a pure matrix (needed for published Rmd)
Seurat_object <- CreateSeuratObject(count = expt_matrix,
meta.data = expt_metadata) # create our Seurat object (needed for published Rmd)
Seurat_object # see the number of genes (features) and cells (samples) in our object - it matches the corresponding dimensions of our initial data
## An object of class Seurat
## 19731 features across 12 samples within 1 assay
## Active assay: RNA (19731 features, 0 variable features)
Recall that both our metadata and our count matrix information, matched by sample names, are stored in the Seurat object.
Seurat_object@meta.data[,1:6] # see our metadata
## orig.ident nCount_RNA nFeature_RNA Sample Treatment Age
## VA1 SeuratProject 8918135 14784 VA1 Vehicle Adult
## VA2 SeuratProject 8706077 14954 VA2 Vehicle Adult
## VA3 SeuratProject 8030799 14842 VA3 Vehicle Adult
## DA1 SeuratProject 6006648 14605 DA1 20mgkg_at_E65to95 Adult
## DA2 SeuratProject 7418558 14756 DA2 20mgkg_at_E65to95 Adult
## DA3 SeuratProject 6997251 14792 DA3 20mgkg_at_E65to95 Adult
## VE1 SeuratProject 24977800 16900 VE1 Vehicle Embryonic
## VE2 SeuratProject 22332557 16900 VE2 Vehicle Embryonic
## DE1 SeuratProject 23473360 17045 DE1 20mgkg_at_E65to95 Embryonic
## DE2 SeuratProject 22869158 16958 DE2 20mgkg_at_E65to95 Embryonic
## DE3 SeuratProject 21601847 16980 DE3 20mgkg_at_E65to95 Embryonic
## DEI SeuratProject 23721425 17248 DEI Vehicle Embryonic
Seurat_object@assays$RNA@counts[1:5, ] # see the first 5 rows and columns of our count matrix
## 5 x 12 sparse Matrix of class "dgCMatrix"
## [[ suppressing 12 column names 'VA1', 'VA2', 'VA3' ... ]]
##
## 0610009B22Rik 195 195 218 166 201 191 325 262 300 288 253 170
## 0610009E02Rik 10 7 5 1 2 4 137 20 9 6 14 75
## 0610009L18Rik 149 130 175 99 159 120 41 41 51 41 32 43
## 0610010F05Rik 123 117 106 107 98 120 499 512 508 594 512 445
## 0610010K14Rik 2 . 3 . 2 . 1128 882 941 944 1064 1216
The first thing people often do with RNAseq data is normalization. We need to do this too. This is important to consider as, for example, we do not want to confound increased gene expression because of biological reasons with increased RNA amplification due to technical variation.
Seurat_object <- NormalizeData(Seurat_object,
normalization.method = "LogNormalize",
scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million
Seurat_object@assays$RNA@data[1:10,1:5] # our normalized data is stored here; preview/see our normalized data
## 10 x 5 sparse Matrix of class "dgCMatrix"
## VA1 VA2 VA3 DA1 DA2
## 0610009B22Rik 3.1296318 3.1526569 3.3373872 3.3546663 3.3355637
## 0610009E02Rik 0.7520341 0.5900264 0.4840317 0.1539926 0.2386973
## 0610009L18Rik 2.8739899 2.7683359 3.1263704 2.8611568 3.1105213
## 0610010F05Rik 2.6940947 2.6699252 2.6531846 2.9345798 2.6539540
## 0610010K14Rik 0.2023383 . 0.3174072 . 0.2386973
## 0610012D04Rik 0.5144907 0.2067902 0.1173568 . .
## 0610025J13Rik 0.7520341 0.7648272 0.8627746 1.1519220 1.1495441
## 0610030E20Rik 2.6152630 2.8312030 3.0167702 2.8321711 2.8115968
## 0610039K10Rik 0.5144907 0.2960868 0.1173568 0.1539926 0.1264539
## 0610040B10Rik 1.1047134 1.3828935 1.4729436 1.6413348 1.3430577
Seurat_object@assays$RNA@counts[1:10,1:5] # our counts are still present in our Seurat object, which we can see here
## 10 x 5 sparse Matrix of class "dgCMatrix"
## VA1 VA2 VA3 DA1 DA2
## 0610009B22Rik 195 195 218 166 201
## 0610009E02Rik 10 7 5 1 2
## 0610009L18Rik 149 130 175 99 159
## 0610010F05Rik 123 117 106 107 98
## 0610010K14Rik 2 . 3 . 2
## 0610012D04Rik 6 2 1 . .
## 0610025J13Rik 10 10 11 13 16
## 0610030E20Rik 113 139 156 96 116
## 0610039K10Rik 6 3 1 1 1
## 0610040B10Rik 18 26 27 25 21
This code/function/action converts our raw count values to counts-per-million (cpm). To demonstrate with mock examples, gene A in sample 1 might have 30 reads (RNA fragments) aligned to it. If sample 1 has 1 million reads detected for it in total, then it would also have 30 cpm for gene A. Gene A in sample 2 might have 60 reads raw, but if 3 million reads were detected for sample 2 in total, then sample 2 would have 20 cpm for gene A. So sample 2 might have greater reads than sample 1 for gene A, but not if we normalize it.
In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data” (under “@assays” and “$RNA”), separate from “counts”, thus allowing us to keep both. This is in contrast to the often-advised-against action of overwriting an object you are working with (which could be hard to reverse).
Note: our initial count matrix for the embryonic data did not consist of integers, indicating that the data may have already been normalized. You have been provided a rounded-data matrix (along with one imputed embryonic sample). What we are doing might thus be double-normalizing. This is ok for demonstration purposes, but will need to be addressed in real-life research situations.
We can now look for differentially expressed genes (DEG’s) between treated and control samples. However, if we do so with both adult and embryonic samples at once, we will likely get confusing or misleading results. Thus we can subset the data first.
Test_object <- subset(Seurat_object, subset = Age == "Adult") # subset for adult samples
Test_object$Age # see the age metadata for our samples
## VA1 VA2 VA3 DA1 DA2 DA3
## "Adult" "Adult" "Adult" "Adult" "Adult" "Adult"
Now we can move on to differential gene expression testing.
Idents(Test_object) <- "Treatment" # we set "Treatment" as the active identity via "Idents()", this lets Seurat know we're using the treatment information stored in the "Treatment" variable to define our comparison groups
results <- FindMarkers(Test_object, # our data
ident.1 = "20mgkg_at_E65to95", # what group are we looking for DEG's for
logfc.threshold = 1.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
min.pct = .25, # what percent of "20mgkg_at_E65to95" samples do we want to express a gene in order for us to consider testing it
only.pos = F, # do we want only genes that are enriched/greater in "ident.1"
test.use = "DESeq2") # statistical/testing method we want to use
results[1:5,] # preview our results
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## Acta1 7.781060e-34 1.2175534 1 1 1.535281e-29
## Vgll2 6.285769e-23 1.8801957 1 1 1.240245e-18
## Nppb 1.467546e-15 1.5907584 1 1 2.895615e-11
## Abra 8.596933e-15 0.5227545 1 1 1.696261e-10
## Irx2 3.118343e-13 -1.8999591 1 1 6.152802e-09
And we’re done with Seurat! In a real analysis, we would adjust the parameters of FindMarkers based on what kind of DEG’s we’re looking for, what context we’re studying, and more.
We can how take a look at the genes that turn up in our result.
results <- tibble::rownames_to_column(results) # have gene names as a formal column
sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
pos_results <- sig_results[sig_results$avg_log2FC > 0,] # filter for up-regulated genes
neg_results <- sig_results[sig_results$avg_log2FC < 0,] # filter for down-regulated genes
pos_results[1:5,] # preview some enriched genes (in treated, vs control)
## rowname p_val avg_log2FC pct.1 pct.2 p_val_adj
## 1 Acta1 7.781060e-34 1.2175534 1 1 1.535281e-29
## 2 Vgll2 6.285769e-23 1.8801957 1 1 1.240245e-18
## 3 Nppb 1.467546e-15 1.5907584 1 1 2.895615e-11
## 4 Abra 8.596933e-15 0.5227545 1 1 1.696261e-10
## 7 Thbs4 1.517718e-12 0.6566105 1 1 2.994610e-08
We can also visualize some of the gene expression values with plots in Seurat.
VlnPlot(Test_object, features = c("Cyp2b10", "Acta1"))
Lastly, we would want to try to get some understanding of what our results mean in aggregate. One general way to do so is via gene set enrichment analysis.
Gene set enrichment analyses essentially estimate the probability and significance that your list of genes overlaps with a list of genes related to function, disease, etc. For our purposes, we will be using the package “clusterProfiler”.
Ref: Wu et al., 2021 : https://pubmed.ncbi.nlm.nih.gov/34557778/
geneList <- pos_results$rowname # get a vector of gene names
# we need to convert our gene names of interest to entrez IDs for use in clusterProfiler
convertedList <- AnnotationDbi::select(org.Mm.eg.db, # reference database
keys = geneList, # the genes we want to convert [1]
columns = c("ENTREZID", "SYMBOL"), # the data we use to convert
keytype = "SYMBOL") # what the format is we have in [1]
geneList <- convertedList$ENTREZID # get vector of entrez IDs
geneList <- geneList[!is.na(geneList)] # renove NA's from list
ego <- enrichGO(gene = geneList, # our entrez IDs of interest
OrgDb = org.Mm.eg.db, # reference list
ont = "MF", # what type of ontology we want (Biological Process, Cellular Component, Molecular Function)
pAdjustMethod = "BH", # multiple-comparisons method
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE) # main function for geneset enrichment
result_table <- ego@result
result_table[1:5,1:3] # preview our results
## ID Description GeneRatio
## GO:0005201 GO:0005201 extracellular matrix structural constituent 19/330
## GO:0008201 GO:0008201 heparin binding 13/330
## GO:0005539 GO:0005539 glycosaminoglycan binding 15/330
## GO:1901681 GO:1901681 sulfur compound binding 16/330
## GO:0050840 GO:0050840 extracellular matrix binding 8/330
And that’s it! We can save our results in a few different ways. CSV is a common format for saving tables.
write.csv(results, "results.csv")
RDS is a common format for any R object (but this would be specific to R).
saveRDS(Seurat_object, "Seurat_object.rds")