This is an R-markdown document for the PCL368 - Introduction to bioinformatics module.
Let’s start by loading all the packages we will need for this session (or install first, as necessary).
#install.packages('Seurat') # as needed
library(Seurat) # for most if not all of our work with RNAseq
#BiocManager::install("DESeq2") # as needed
library(DESeq2) # differential gene-expression test package; may not need to be loaded, but definitely need to be installed to use its test
Now let’s load our data, which is a combined bulk-tissue RNAseq dataset from mouse embryonic or adult ventricles after gestational exposure to TU3678.
Seurat_object <- readRDS("./Mouse_data/Seurat_object.rds")
expt_metadata <- Seurat_object@meta.data # our metadata
expt_matrix <- as.matrix(Seurat_object@assays$RNA@counts) # our count matrix, expanded
When starting any new project or research, it is always important to understand the contents, characteristics and distribution of our data (or the data we will be collecting, if it was a brand new experiment). First, let’s take a look at the Seurat data structure.
Seurat_object
## An object of class Seurat
## 19731 features across 12 samples within 1 assay
## Active assay: RNA (19731 features, 0 variable features)
Now let us look at the gene expression data we are working with.
expt_matrix[1:5,] # let's see the first 5 rows and columns of our count matrix
## VA1 VA2 VA3 DA1 DA2 DA3 VE1 VE2 DE1 DE2 DE3 DEI
## 0610009B22Rik 195 195 218 166 201 191 325 262 300 288 253 170
## 0610009E02Rik 10 7 5 1 2 4 137 20 9 6 14 75
## 0610009L18Rik 149 130 175 99 159 120 41 41 51 41 32 43
## 0610010F05Rik 123 117 106 107 98 120 499 512 508 594 512 445
## 0610010K14Rik 2 0 3 0 2 0 1128 882 941 944 1064 1216
We have every row representing one gene (identified by row names here starting with “06100”) and every column representing one sample. In our case, a sample is a pooled collection of ventricle tissue. In other RNAseq experiments, a sample can be one cell, a collection of cells in one test tube, or a chunk of tissue from one animal. The numbers in the main matrix represent the number of RNA fragments, or reads, detected for one gene in one sample. For example, the sample named “VA3” has 175 RNA reads detected for the gene “0610009L18Rik”. This is the general format of count matrices for RNAseq data.
Finally, let us look at the metadata.
expt_metadata[1:5,] #first five rows of our metadata
## orig.ident nCount_RNA nFeature_RNA Sample Treatment Age
## VA1 SeuratProject 8918135 14784 VA1 Vehicle Adult
## VA2 SeuratProject 8706077 14954 VA2 Vehicle Adult
## VA3 SeuratProject 8030799 14842 VA3 Vehicle Adult
## DA1 SeuratProject 6006648 14605 DA1 20mgkg_at_E65to95 Adult
## DA2 SeuratProject 7418558 14756 DA2 20mgkg_at_E65to95 Adult
## Source Group
## VA1 Public_Expt Adult_Vehicle
## VA2 Public_Expt Adult_Vehicle
## VA3 Public_Expt Adult_Vehicle
## DA1 Public_Expt Adult_20mgkg_at_E65to95
## DA2 Public_Expt Adult_20mgkg_at_E65to95
As you can see, the format is similar to the count matrix, but now every row is one sample and every column a metadata variable. Often times there are many samples and types of samples. We can use various functions to see the breakdown of data.
unique(expt_metadata$Treatment) #let's see what treatment groups we have
## [1] "Vehicle" "20mgkg_at_E65to95"
table(expt_metadata$Treatment, expt_metadata$Age) #let's see the sample size of our data between treatment types and age
##
## Adult Embryonic
## 20mgkg_at_E65to95 3 3
## Vehicle 3 3
We can see from the table function that there are three samples for each group between Vehicle, Treated, Adult, and Embryonic.
Now that we have an understanding of our data and metadata, we can move on to our analysis, which we will perform with Seurat functions.
The first thing people often do with RNAseq data is normalization. We need to do this too. This is important to consider as, for example, we do not want to confound increased gene expression because of biological reasons with increased RNA amplification due to technical variation.
Seurat_object <- NormalizeData(Seurat_object,
normalization.method = "LogNormalize",
scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million
Seurat_object@assays$RNA@data[1:10,1:5] # our normalized data is stored here; preview/see our normalized data
## 10 x 5 sparse Matrix of class "dgCMatrix"
## VA1 VA2 VA3 DA1 DA2
## 0610009B22Rik 3.1296318 3.1526569 3.3373872 3.3546663 3.3355637
## 0610009E02Rik 0.7520341 0.5900264 0.4840317 0.1539926 0.2386973
## 0610009L18Rik 2.8739899 2.7683359 3.1263704 2.8611568 3.1105213
## 0610010F05Rik 2.6940947 2.6699252 2.6531846 2.9345798 2.6539540
## 0610010K14Rik 0.2023383 . 0.3174072 . 0.2386973
## 0610012D04Rik 0.5144907 0.2067902 0.1173568 . .
## 0610025J13Rik 0.7520341 0.7648272 0.8627746 1.1519220 1.1495441
## 0610030E20Rik 2.6152630 2.8312030 3.0167702 2.8321711 2.8115968
## 0610039K10Rik 0.5144907 0.2960868 0.1173568 0.1539926 0.1264539
## 0610040B10Rik 1.1047134 1.3828935 1.4729436 1.6413348 1.3430577
We just converted our raw count values to counts-per-million (cpm). To demonstrate with mock examples, gene A in sample 1 might have 30 reads (RNA fragments) aligned to it. If sample 1 has 1 million reads detected for it in total, then it would also have 30 cpm for gene A. Gene A in sample 2 might have 60 reads raw, but if 3 million reads were detected for sample 2 in total, then sample 2 would have 20 cpm for gene A. So sample 2 might have greater reads than sample 1 for gene A, but not if we normalize it.
In many Seurat functions, the output is often saved into a proper part of the Seurat object. In this case, our normalized data is saved under “data” (under “@assays” and “$RNA”), separate from “counts” , thus allowing us to keep both. This is in contrast to the often-advised-against action of overwriting an object you are working with (which could be hard to reverse).
Note: our initial count matrix for the embryonic data did not consist of integers, indicating that the data may have already been normalized. You have been provided a rounded-data matrix (along with one imputed embryonic sample). What we are doing might thus be double-normalizing. This is ok for demonstration purposes, but will definitely need to be addressed in real-life research situations.
We can now look for differentially expressed genes (DEG’s) between treated and control samples. However, if we do so with both adult and embryonic samples at once, we will likely get confusing or misleading results. Thus we subset the data first.
Test_object <- subset(Seurat_object, subset = Age == "Embryonic") # subset for adult samples
Now we can move on to differential gene expression testing. Ideally, we would do this with DESeq2 (and the functions in the DESeq2 package), but due to the robustness of Seurat, we will use Seurat and its adapted DESeq2 test (if DESeq2 installed without issues) or another test if there was any issues with the DESeq2 package.
Idents(Test_object) <- "Treatment" # we set "Treatment" as the active identity via "Idents()", this lets Seurat know we're using the treatment information stored in the "Treatment" variable to define our comparison groups
results <- FindMarkers(Test_object, # our data
ident.1 = "20mgkg_at_E65to95", # what group are we looking for DEG's for
logfc.threshold = 1.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
min.pct = .25, # what percent of "20mgkg_at_E65to95" samples do we want to express a gene in order for us to consider testing it
only.pos = F, # do we want only genes that are enriched/greater in "ident.1"
test.use = "DESeq2") # statistical/testing method we want to use
results[1:5,] # see first 5 results
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## Gm17494 4.902857e-18 -2.451796 1 1 9.673827e-14
## Gm13341 1.654821e-17 -1.575678 1 1 3.265127e-13
## Gm11407 1.067218e-14 -1.747157 1 1 2.105727e-10
## Gm14287 2.735318e-13 -6.409391 0 1 5.397055e-09
## Gm5940 8.937089e-13 -1.600335 1 1 1.763377e-08
And we’re done with Seurat! In a real analysis, we would adjust the parameters of FindMarkers based on what kind of DEG’s we’re looking for, what context we’re studying, and more.
We can how take a look at the genes that turn up in our result.
results <- tibble::rownames_to_column(results) # have gene names as a formal column
sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
pos_results <- sig_results[sig_results$avg_log2FC > 0,] # filter for up-regulated genes
We can also visualize some of the gene expression values with plots in Seurat.
VlnPlot(Test_object, features = c("Rpl26", "Kpna2"))
And that’s it! We can save our results in a few different ways. CSV is a common format for saving tables.
# write.csv(results, "./results.csv", row.names = F) # note you don't actually need the "./" part here