Introduciton to bioinformatics (p2)

Setting up

This is an R-markdown document for PCL1491 - Introduction to bioinformatics (part 2).

Let’s start by loading all the packages we will need for this session.

library(Seurat) # for most of our work with RNAseq
# install.packages("BiocManager")
# BiocManager::install("org.Mm.eg.db") # as needed
library(org.Mm.eg.db) # gene ontology database for mouse
# BiocManager::install("clusterProfiler") # as needed
library(clusterProfiler) # gene set enrichment analysis package
#BiocManager::install("DESeq2") # as needed
# library(DESeq2) # differential gene-expression test package; may not need to be loaded, but definitely need to be installed to use its test

We ended last time after creating our Seurat object.

Seurat_object <- readRDS("./Seurat_object.rds") # as appropriate
Seurat_object # see the number of genes (features) and cells (samples) in our object - it matches the corresponding dimensions of our initial data

## An object of class Seurat 
## 19731 features across 12 samples within 1 assay 
## Active assay: RNA (19731 features, 0 variable features)

Recall that both our metadata and our count matrix information, matched by sample names, are stored in the Seurat object.

The first thing people often do with RNAseq data is normalization. We need to do this too. This is important to consider as, for example, we do not want to confound increased gene expression because of biological reasons with increased RNA amplification due to technical variation.

Seurat_object <- NormalizeData(Seurat_object, 
                               normalization.method = "LogNormalize", 
                               scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million

Seurat_object@assays$RNA@data[1:10,1:5] # our normalized data is stored here; preview/see our normalized data

## 10 x 5 sparse Matrix of class "dgCMatrix"
##                     VA1       VA2       VA3       DA1       DA2
## 0610009B22Rik 3.1296318 3.1526569 3.3373872 3.3546663 3.3355637
## 0610009E02Rik 0.7520341 0.5900264 0.4840317 0.1539926 0.2386973
## 0610009L18Rik 2.8739899 2.7683359 3.1263704 2.8611568 3.1105213
## 0610010F05Rik 2.6940947 2.6699252 2.6531846 2.9345798 2.6539540
## 0610010K14Rik 0.2023383 .         0.3174072 .         0.2386973
## 0610012D04Rik 0.5144907 0.2067902 0.1173568 .         .        
## 0610025J13Rik 0.7520341 0.7648272 0.8627746 1.1519220 1.1495441
## 0610030E20Rik 2.6152630 2.8312030 3.0167702 2.8321711 2.8115968
## 0610039K10Rik 0.5144907 0.2960868 0.1173568 0.1539926 0.1264539
## 0610040B10Rik 1.1047134 1.3828935 1.4729436 1.6413348 1.3430577

This code/function/action converts our raw count values to counts-per-million (cpm). To demonstrate with mock examples, gene A in sample 1 might have 30 reads (RNA fragments) aligned to it. If sample 1 has 1 million reads detected for it in total, then it would also have 30 cpm for gene A. Gene A in sample 2 might have 60 reads raw, but if 3 million reads were detected for sample 2 in total, then sample 2 would have 20 cpm for gene A. So sample 2 might have greater reads than sample 1 for gene A, but not if we normalize it.

In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data” (under “@assays” and “$RNA”), separate from “counts”, thus allowing us to keep both. This is in contrast to the often-advised-against action of overwriting an object you are working with (which could be hard to reverse).

Note: our initial count matrix for the embryonic data did not consist of integers, indicating that the data may have already been normalized. You have been provided a rounded-data matrix (along with one imputed embryonic sample). What we are doing might thus be double-normalizing. This is ok for demonstration purposes, but will need to be addressed in real-life research situations.

Differential gene expression testing

We can now look for differentially expressed genes (DEG’s) between treated and control samples. However, if we do so with both adult and embryonic samples at once, we will likely get confusing or misleading results. Thus we can subset the data first.

Test_object <- subset(Seurat_object, subset = Age == "Embryonic") # subset for adult samples

Now we can move on to differential gene expression testing.

Idents(Test_object) <- "Treatment" # we set "Treatment" as the active identity via "Idents()", this lets Seurat know we're using the treatment information stored in the "Treatment" variable to define our comparison groups

results <- FindMarkers(Test_object, # our data
                       ident.1 = "20mgkg_at_E65to95", # what group are we looking for DEG's for
                       logfc.threshold = 1.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
                       min.pct = .25, # what percent of "20mgkg_at_E65to95" samples do we want to express a gene in order for us to consider testing it
                       only.pos = F, # do we want only genes that are enriched/greater in "ident.1"
                       test.use = "DESeq2") # statistical/testing method we want to use

results[1:5,] # see first 5 results

##                p_val avg_log2FC pct.1 pct.2    p_val_adj
## Gm17494 4.902975e-18  -2.451796     1     1 9.674060e-14
## Gm13341 1.654806e-17  -1.575678     1     1 3.265098e-13
## Gm11407 1.067264e-14  -1.747157     1     1 2.105818e-10
## Gm14287 2.735312e-13  -6.409391     0     1 5.397043e-09
## Gm5940  8.937170e-13  -1.600335     1     1 1.763393e-08

And we’re done with Seurat! In a real analysis, we would adjust the parameters of FindMarkers based on what kind of DEG’s we’re looking for, what context we’re studying, and more.

We can how take a look at the genes that turn up in our result.

results <- tibble::rownames_to_column(results) # have gene names as a formal column
sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
pos_results <- sig_results[sig_results$avg_log2FC > 0,] # filter for up-regulated genes

We can also visualize some of the gene expression values with plots in Seurat.

VlnPlot(Test_object, features = c("Rpl26", "Kpna2"))

Gene (set) enrichment analysis

Lastly, we would want to try to get some understanding of what our results mean in aggregate. One general way to do so is via gene set enrichment analysis.

Gene set enrichment analyses essentially estimate the probability and significance that your list of genes overlaps with a list of genes related to function, disease, etc. For our purposes, we will be using the package “clusterProfiler”.

Ref: Wu et al., 2021 : https://pubmed.ncbi.nlm.nih.gov/34557778/

geneList <- pos_results$rowname # get a vector of gene names

# we need to convert our gene names of interest to entrez IDs for use in clusterProfiler

convertedList <- AnnotationDbi::select(org.Mm.eg.db, # reference database
                        keys = geneList, # the genes we want to convert [1]
                        columns = c("ENTREZID", "SYMBOL"), # the data we use to convert
                        keytype = "SYMBOL") # what the format is we have in [1]

## 'select()' returned 1:1 mapping between keys and columns

geneList <- convertedList$ENTREZID # get vector of entrez IDs
geneList <- geneList[!is.na(geneList)] # renove NA's from list

ego <- enrichGO(gene = geneList, # our entrez IDs of interest
                OrgDb = org.Mm.eg.db, # reference list
                ont = "CC", # what type of ontology we want (Biological Process, Cellular Component, Molecular Function)
                pAdjustMethod = "BH", # multiple-comparisons method
                pvalueCutoff  = 0.01, 
                qvalueCutoff  = 0.05,
                readable = TRUE) # main function for geneset enrichment

result_table <- ego@result

result_table[1:5,1:4] # see first 5 results (first 4 columns)

##                    ID                          Description GeneRatio   BgRatio
## GO:0032279 GO:0032279                   asymmetric synapse    25/680 405/23271
## GO:0099634 GO:0099634 postsynaptic specialization membrane    12/680 136/23271
## GO:0097060 GO:0097060                    synaptic membrane    27/680 470/23271
## GO:0014069 GO:0014069                 postsynaptic density    24/680 400/23271
## GO:0034703 GO:0034703               cation channel complex    16/680 222/23271

And that’s it! We can save our results in a few different ways. CSV is a common format for saving tables.

write.csv(results, "./results.csv") # note you don't actually need the "./" part here

Introduciton to bioinformatics (p2)

Dr. Yuxiao (Sonny) Chen

Jun 23, 2022

Setting up

Differential gene expression testing

Gene (set) enrichment analysis