Setting up

This is an R-markdown document for PCL368 - Introduction to bioinformatics (part 2).

Let’s start by loading all the packages we will need for this session - revisit part 1 for installation code.

library(Seurat) # for most of our work with RNAseq
library(org.Mm.eg.db) # gene ontology database for mouse
library(clusterProfiler) # gene set enrichment analysis package
#library(DESeq2) # differential gene-expression test package; may not need to be loaded, but definitely need to be installed to use its test

We ended last time after creating our Seurat object.

expt_metadata <- read.csv("./Mouse_data/combined_metadata.csv", row.names=1) # our metadata (needed for published Rmd)
row.names(expt_metadata) <- expt_metadata$Sample # set sample names to row names (needed for published Rmd)
expt_matrix <- read.csv("./Mouse_data/combined_count_matrix.csv", row.names=1) # our count matrix (needed for published Rmd)
row.names(expt_matrix) <- expt_matrix$Gene_name # set gene names to row names (needed for published Rmd)
expt_matrix <- as.matrix(expt_matrix[,-1]) # remove the first column of gene names, to leave us with a pure matrix (needed for published Rmd)
Seurat_object <- CreateSeuratObject(count = expt_matrix, 
                                    meta.data = expt_metadata) # create our Seurat object (needed for published Rmd)

Seurat_object # see the number of genes (features) and cells (samples) in our object - it matches the corresponding dimensions of our initial data
## An object of class Seurat 
## 19731 features across 12 samples within 1 assay 
## Active assay: RNA (19731 features, 0 variable features)

Recall that both our metadata and our count matrix information, matched by sample names, are stored in the Seurat object.

Seurat_object@meta.data[,1:6] # see our metadata
##        orig.ident nCount_RNA nFeature_RNA Sample         Treatment       Age
## VA1 SeuratProject    8918135        14784    VA1           Vehicle     Adult
## VA2 SeuratProject    8706077        14954    VA2           Vehicle     Adult
## VA3 SeuratProject    8030799        14842    VA3           Vehicle     Adult
## DA1 SeuratProject    6006648        14605    DA1 20mgkg_at_E65to95     Adult
## DA2 SeuratProject    7418558        14756    DA2 20mgkg_at_E65to95     Adult
## DA3 SeuratProject    6997251        14792    DA3 20mgkg_at_E65to95     Adult
## VE1 SeuratProject   24977800        16900    VE1           Vehicle Embryonic
## VE2 SeuratProject   22332557        16900    VE2           Vehicle Embryonic
## DE1 SeuratProject   23473360        17045    DE1 20mgkg_at_E65to95 Embryonic
## DE2 SeuratProject   22869158        16958    DE2 20mgkg_at_E65to95 Embryonic
## DE3 SeuratProject   21601847        16980    DE3 20mgkg_at_E65to95 Embryonic
## DEI SeuratProject   23721425        17248    DEI           Vehicle Embryonic
Seurat_object@assays$RNA@counts[1:5, ] # see the first 5 rows and columns of our count matrix
## 5 x 12 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 12 column names 'VA1', 'VA2', 'VA3' ... ]]
##                                                                 
## 0610009B22Rik 195 195 218 166 201 191  325 262 300 288  253  170
## 0610009E02Rik  10   7   5   1   2   4  137  20   9   6   14   75
## 0610009L18Rik 149 130 175  99 159 120   41  41  51  41   32   43
## 0610010F05Rik 123 117 106 107  98 120  499 512 508 594  512  445
## 0610010K14Rik   2   .   3   .   2   . 1128 882 941 944 1064 1216

The first thing people often do with RNAseq data is normalization. We need to do this too. This is important to consider as, for example, we do not want to confound increased gene expression because of biological reasons with increased RNA amplification due to technical variation.

Seurat_object <- NormalizeData(Seurat_object, 
                               normalization.method = "LogNormalize", 
                               scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million

Seurat_object@assays$RNA@data[1:10,1:5] # our normalized data is stored here; preview/see our normalized data
## 10 x 5 sparse Matrix of class "dgCMatrix"
##                     VA1       VA2       VA3       DA1       DA2
## 0610009B22Rik 3.1296318 3.1526569 3.3373872 3.3546663 3.3355637
## 0610009E02Rik 0.7520341 0.5900264 0.4840317 0.1539926 0.2386973
## 0610009L18Rik 2.8739899 2.7683359 3.1263704 2.8611568 3.1105213
## 0610010F05Rik 2.6940947 2.6699252 2.6531846 2.9345798 2.6539540
## 0610010K14Rik 0.2023383 .         0.3174072 .         0.2386973
## 0610012D04Rik 0.5144907 0.2067902 0.1173568 .         .        
## 0610025J13Rik 0.7520341 0.7648272 0.8627746 1.1519220 1.1495441
## 0610030E20Rik 2.6152630 2.8312030 3.0167702 2.8321711 2.8115968
## 0610039K10Rik 0.5144907 0.2960868 0.1173568 0.1539926 0.1264539
## 0610040B10Rik 1.1047134 1.3828935 1.4729436 1.6413348 1.3430577
Seurat_object@assays$RNA@counts[1:10,1:5] # our counts are still present in our Seurat object, which we can see here
## 10 x 5 sparse Matrix of class "dgCMatrix"
##               VA1 VA2 VA3 DA1 DA2
## 0610009B22Rik 195 195 218 166 201
## 0610009E02Rik  10   7   5   1   2
## 0610009L18Rik 149 130 175  99 159
## 0610010F05Rik 123 117 106 107  98
## 0610010K14Rik   2   .   3   .   2
## 0610012D04Rik   6   2   1   .   .
## 0610025J13Rik  10  10  11  13  16
## 0610030E20Rik 113 139 156  96 116
## 0610039K10Rik   6   3   1   1   1
## 0610040B10Rik  18  26  27  25  21

This code/function/action converts our raw count values to counts-per-million (cpm). To demonstrate with mock examples, gene A in sample 1 might have 30 reads (RNA fragments) aligned to it. If sample 1 has 1 million reads detected for it in total, then it would also have 30 cpm for gene A. Gene A in sample 2 might have 60 reads raw, but if 3 million reads were detected for sample 2 in total, then sample 2 would have 20 cpm for gene A. So sample 2 might have greater reads than sample 1 for gene A, but not if we normalize it.

In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data” (under “@assays” and “$RNA”), separate from “counts”, thus allowing us to keep both. This is in contrast to the often-advised-against action of overwriting an object you are working with (which could be hard to reverse).

Note: our initial count matrix for the embryonic data did not consist of integers, indicating that the data may have already been normalized. You have been provided a rounded-data matrix (along with one imputed embryonic sample). What we are doing might thus be double-normalizing. This is ok for demonstration purposes, but will need to be addressed in real-life research situations.

Differential gene expression testing

We can now look for differentially expressed genes (DEG’s) between treated and control samples. However, if we do so with both adult and embryonic samples at once, we will likely get confusing or misleading results. Thus we can subset the data first.

Test_object <- subset(Seurat_object, subset = Age == "Adult") # subset for adult samples

Test_object$Age # see the age metadata for our samples
##     VA1     VA2     VA3     DA1     DA2     DA3 
## "Adult" "Adult" "Adult" "Adult" "Adult" "Adult"

Now we can move on to differential gene expression testing.

Idents(Test_object) <- "Treatment" # we set "Treatment" as the active identity via "Idents()", this lets Seurat know we're using the treatment information stored in the "Treatment" variable to define our comparison groups

results <- FindMarkers(Test_object, # our data
                       ident.1 = "20mgkg_at_E65to95", # what group are we looking for DEG's for
                       logfc.threshold = 1.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
                       min.pct = .25, # what percent of "20mgkg_at_E65to95" samples do we want to express a gene in order for us to consider testing it
                       only.pos = F, # do we want only genes that are enriched/greater in "ident.1"
                       test.use = "DESeq2") # statistical/testing method we want to use

results[1:5,] # preview our results
##              p_val avg_log2FC pct.1 pct.2    p_val_adj
## Acta1 7.781060e-34  1.2175534     1     1 1.535281e-29
## Vgll2 6.285769e-23  1.8801957     1     1 1.240245e-18
## Nppb  1.467546e-15  1.5907584     1     1 2.895615e-11
## Abra  8.596933e-15  0.5227545     1     1 1.696261e-10
## Irx2  3.118343e-13 -1.8999591     1     1 6.152802e-09

And we’re done with Seurat! In a real analysis, we would adjust the parameters of FindMarkers based on what kind of DEG’s we’re looking for, what context we’re studying, and more.

We can how take a look at the genes that turn up in our result.

results <- tibble::rownames_to_column(results) # have gene names as a formal column
sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
pos_results <- sig_results[sig_results$avg_log2FC > 0,] # filter for up-regulated genes
neg_results <- sig_results[sig_results$avg_log2FC < 0,] # filter for down-regulated genes

pos_results[1:5,] # preview some enriched genes (in treated, vs control)
##   rowname        p_val avg_log2FC pct.1 pct.2    p_val_adj
## 1   Acta1 7.781060e-34  1.2175534     1     1 1.535281e-29
## 2   Vgll2 6.285769e-23  1.8801957     1     1 1.240245e-18
## 3    Nppb 1.467546e-15  1.5907584     1     1 2.895615e-11
## 4    Abra 8.596933e-15  0.5227545     1     1 1.696261e-10
## 7   Thbs4 1.517718e-12  0.6566105     1     1 2.994610e-08

We can also visualize some of the gene expression values with plots in Seurat.

VlnPlot(Test_object, features = c("Cyp2b10", "Acta1"))

Gene (set) enrichment analysis

Lastly, we would want to try to get some understanding of what our results mean in aggregate. One general way to do so is via gene set enrichment analysis.

Gene set enrichment analyses essentially estimate the probability and significance that your list of genes overlaps with a list of genes related to function, disease, etc. For our purposes, we will be using the package “clusterProfiler”.

Ref: Wu et al., 2021 : https://pubmed.ncbi.nlm.nih.gov/34557778/

geneList <- pos_results$rowname # get a vector of gene names

# we need to convert our gene names of interest to entrez IDs for use in clusterProfiler

convertedList <- AnnotationDbi::select(org.Mm.eg.db, # reference database
                        keys = geneList, # the genes we want to convert [1]
                        columns = c("ENTREZID", "SYMBOL"), # the data we use to convert
                        keytype = "SYMBOL") # what the format is we have in [1]

geneList <- convertedList$ENTREZID # get vector of entrez IDs
geneList <- geneList[!is.na(geneList)] # renove NA's from list

ego <- enrichGO(gene = geneList, # our entrez IDs of interest
                OrgDb = org.Mm.eg.db, # reference list
                ont = "MF", # what type of ontology we want (Biological Process, Cellular Component, Molecular Function)
                pAdjustMethod = "BH", # multiple-comparisons method
                pvalueCutoff  = 0.01, 
                qvalueCutoff  = 0.05,
                readable = TRUE) # main function for geneset enrichment

result_table <- ego@result

result_table[1:5,1:3] # preview our results
##                    ID                                 Description GeneRatio
## GO:0005201 GO:0005201 extracellular matrix structural constituent    19/330
## GO:0008201 GO:0008201                             heparin binding    13/330
## GO:0005539 GO:0005539                   glycosaminoglycan binding    15/330
## GO:1901681 GO:1901681                     sulfur compound binding    16/330
## GO:0050840 GO:0050840                extracellular matrix binding     8/330

And that’s it! We can save our results in a few different ways. CSV is a common format for saving tables.

write.csv(results, "results.csv")

RDS is a common format for any R object (but this would be specific to R).

saveRDS(Seurat_object, "Seurat_object.rds")