Introduciton to bioinformatics

Setting up

This is an R-markdown document for the PCL368 - Introduction to bioinformatics module.

Let’s start by loading all the packages we will need for this session (or install first, as necessary).

#install.packages('Seurat') # as needed
library(Seurat) # for most if not all of our work with RNAseq
#BiocManager::install("DESeq2") # as needed
library(DESeq2) # differential gene-expression test package; may not need to be loaded, but definitely need to be installed to use its test

Now let’s load our data, which is a combined bulk-tissue RNAseq dataset from mouse embryonic or adult ventricles after gestational exposure to TU3678.

Seurat_object <- readRDS("./Mouse_data/Seurat_object.rds")

expt_metadata <- Seurat_object@meta.data # our metadata
expt_matrix <- as.matrix(Seurat_object@assays$RNA@counts) # our count matrix, expanded

When starting any new project or research, it is always important to understand the contents, characteristics and distribution of our data (or the data we will be collecting, if it was a brand new experiment). First, let’s take a look at the Seurat data structure.

Seurat_object

## An object of class Seurat 
## 19731 features across 12 samples within 1 assay 
## Active assay: RNA (19731 features, 0 variable features)

Now let us look at the gene expression data we are working with.

expt_matrix[1:5,] # let's see the first 5 rows and columns of our count matrix

##               VA1 VA2 VA3 DA1 DA2 DA3  VE1 VE2 DE1 DE2  DE3  DEI
## 0610009B22Rik 195 195 218 166 201 191  325 262 300 288  253  170
## 0610009E02Rik  10   7   5   1   2   4  137  20   9   6   14   75
## 0610009L18Rik 149 130 175  99 159 120   41  41  51  41   32   43
## 0610010F05Rik 123 117 106 107  98 120  499 512 508 594  512  445
## 0610010K14Rik   2   0   3   0   2   0 1128 882 941 944 1064 1216

We have every row representing one gene (identified by row names here starting with “06100”) and every column representing one sample. In our case, a sample is a pooled collection of ventricle tissue. In other RNAseq experiments, a sample can be one cell, a collection of cells in one test tube, or a chunk of tissue from one animal. The numbers in the main matrix represent the number of RNA fragments, or reads, detected for one gene in one sample. For example, the sample named “VA3” has 175 RNA reads detected for the gene “0610009L18Rik”. This is the general format of count matrices for RNAseq data.

Finally, let us look at the metadata.

expt_metadata[1:5,] #first five rows of our metadata

##        orig.ident nCount_RNA nFeature_RNA Sample         Treatment   Age
## VA1 SeuratProject    8918135        14784    VA1           Vehicle Adult
## VA2 SeuratProject    8706077        14954    VA2           Vehicle Adult
## VA3 SeuratProject    8030799        14842    VA3           Vehicle Adult
## DA1 SeuratProject    6006648        14605    DA1 20mgkg_at_E65to95 Adult
## DA2 SeuratProject    7418558        14756    DA2 20mgkg_at_E65to95 Adult
##          Source                   Group
## VA1 Public_Expt           Adult_Vehicle
## VA2 Public_Expt           Adult_Vehicle
## VA3 Public_Expt           Adult_Vehicle
## DA1 Public_Expt Adult_20mgkg_at_E65to95
## DA2 Public_Expt Adult_20mgkg_at_E65to95

As you can see, the format is similar to the count matrix, but now every row is one sample and every column a metadata variable. Often times there are many samples and types of samples. We can use various functions to see the breakdown of data.

unique(expt_metadata$Treatment) #let's see what treatment groups we have

## [1] "Vehicle"           "20mgkg_at_E65to95"

table(expt_metadata$Treatment, expt_metadata$Age) #let's see the sample size of our data between treatment types and age

##                    
##                     Adult Embryonic
##   20mgkg_at_E65to95     3         3
##   Vehicle               3         3

We can see from the table function that there are three samples for each group between Vehicle, Treated, Adult, and Embryonic.

Analysis with Seurat

Now that we have an understanding of our data and metadata, we can move on to our analysis, which we will perform with Seurat functions.

The first thing people often do with RNAseq data is normalization. We need to do this too. This is important to consider as, for example, we do not want to confound increased gene expression because of biological reasons with increased RNA amplification due to technical variation.

Seurat_object <- NormalizeData(Seurat_object, 
                               normalization.method = "LogNormalize", 
                               scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million

Seurat_object@assays$RNA@data[1:10,1:5] # our normalized data is stored here; preview/see our normalized data

## 10 x 5 sparse Matrix of class "dgCMatrix"
##                     VA1       VA2       VA3       DA1       DA2
## 0610009B22Rik 3.1296318 3.1526569 3.3373872 3.3546663 3.3355637
## 0610009E02Rik 0.7520341 0.5900264 0.4840317 0.1539926 0.2386973
## 0610009L18Rik 2.8739899 2.7683359 3.1263704 2.8611568 3.1105213
## 0610010F05Rik 2.6940947 2.6699252 2.6531846 2.9345798 2.6539540
## 0610010K14Rik 0.2023383 .         0.3174072 .         0.2386973
## 0610012D04Rik 0.5144907 0.2067902 0.1173568 .         .        
## 0610025J13Rik 0.7520341 0.7648272 0.8627746 1.1519220 1.1495441
## 0610030E20Rik 2.6152630 2.8312030 3.0167702 2.8321711 2.8115968
## 0610039K10Rik 0.5144907 0.2960868 0.1173568 0.1539926 0.1264539
## 0610040B10Rik 1.1047134 1.3828935 1.4729436 1.6413348 1.3430577

We just converted our raw count values to counts-per-million (cpm). To demonstrate with mock examples, gene A in sample 1 might have 30 reads (RNA fragments) aligned to it. If sample 1 has 1 million reads detected for it in total, then it would also have 30 cpm for gene A. Gene A in sample 2 might have 60 reads raw, but if 3 million reads were detected for sample 2 in total, then sample 2 would have 20 cpm for gene A. So sample 2 might have greater reads than sample 1 for gene A, but not if we normalize it.

In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data” (under “@assays” and “$RNA”), separate from “counts” , thus allowing us to keep both. This is in contrast to the often-advised-against action of overwriting an object you are working with (which could be hard to reverse).

Note: our initial count matrix for the embryonic data did not consist of integers, indicating that the data may have already been normalized. You have been provided a rounded-data matrix (along with one imputed embryonic sample). What we are doing might thus be double-normalizing. This is ok for demonstration purposes, but will definitely need to be addressed in real-life research situations.

We can now look for differentially expressed genes (DEG’s) between treated and control samples. However, if we do so with both adult and embryonic samples at once, we will likely get confusing or misleading results. Thus we subset the data first.

Test_object <- subset(Seurat_object, subset = Age == "Embryonic") # subset for adult samples

Now we can move on to differential gene expression testing. Ideally, we would do this with DESeq2 (and the functions in the DESeq2 package), but due to the robustness of Seurat, we will use Seurat and its adapted DESeq2 test (if DESeq2 installed without issues) or another test if there was any issues with the DESeq2 package.

Idents(Test_object) <- "Treatment" # we set "Treatment" as the active identity via "Idents()", this lets Seurat know we're using the treatment information stored in the "Treatment" variable to define our comparison groups

results <- FindMarkers(Test_object, # our data
                       ident.1 = "20mgkg_at_E65to95", # what group are we looking for DEG's for
                       logfc.threshold = 1.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
                       min.pct = .25, # what percent of "20mgkg_at_E65to95" samples do we want to express a gene in order for us to consider testing it
                       only.pos = F, # do we want only genes that are enriched/greater in "ident.1"
                       test.use = "DESeq2") # statistical/testing method we want to use

results[1:5,] # see first 5 results

##                p_val avg_log2FC pct.1 pct.2    p_val_adj
## Gm17494 4.902857e-18  -2.451796     1     1 9.673827e-14
## Gm13341 1.654821e-17  -1.575678     1     1 3.265127e-13
## Gm11407 1.067218e-14  -1.747157     1     1 2.105727e-10
## Gm14287 2.735318e-13  -6.409391     0     1 5.397055e-09
## Gm5940  8.937089e-13  -1.600335     1     1 1.763377e-08

And we’re done with Seurat! In a real analysis, we would adjust the parameters of FindMarkers based on what kind of DEG’s we’re looking for, what context we’re studying, and more.

We can how take a look at the genes that turn up in our result.

results <- tibble::rownames_to_column(results) # have gene names as a formal column
sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
pos_results <- sig_results[sig_results$avg_log2FC > 0,] # filter for up-regulated genes

We can also visualize some of the gene expression values with plots in Seurat.

VlnPlot(Test_object, features = c("Rpl26", "Kpna2"))

And that’s it! We can save our results in a few different ways. CSV is a common format for saving tables.

# write.csv(results, "./results.csv", row.names = F) # note you don't actually need the "./" part here

Introduciton to bioinformatics

Dr. Yuxiao (Sonny) Chen

Mar 30, 2023

Setting up

Analysis with Seurat