This is an R-markdown document for the PCL368 - Introduction to bioinformatics module.
Let’s start by loading all the packages we will need for this session (or install first, as necessary).
#install.packages('Seurat') # as needed
library(Seurat)
#install.packages("BiocManager")
#BiocManager::install("DESeq2") # as needed
library(DESeq2) # differential gene-expression test package; may not need to be loaded for our class, but definitely need to be installed to use its test
Note: we use Seurat for our lesson due to its ease-of-installation across different systems. Seurat is designed and used for single-cell (or single-nucleus) RNAseq analysis. Bulk RNAseq analysis is typically performed with DESeq2 directly, edgeR, or limma-voom.
Now let’s load our data, which is a combined bulk-tissue RNAseq dataset from mouse embryonic or adult ventricles after gestational exposure to TU3678.
Seurat_object <- readRDS("./Mouse_data/Seurat_object.rds")
expt_metadata <- Seurat_object@meta.data # our metadata
expt_matrix <- as.matrix(Seurat_object@assays$RNA@counts) # our count matrix, expanded
When starting any new project or research, it is always important to understand the contents, characteristics and distribution of our data. First, let’s take a look at the Seurat data structure. It is essentially a complex list with different, specifically-defined components.
Seurat_object
## An object of class Seurat
## 19731 features across 12 samples within 1 assay
## Active assay: RNA (19731 features, 0 variable features)
## 2 layers present: counts, data
Next let us look at the gene expression data we are working with.
expt_matrix[1:5,] # let's see the first 5 rows (genes/features) of our count matrix
## VA1 VA2 VA3 DA1 DA2 DA3 VE1 VE2 DE1 DE2 DE3 DEI
## 0610009B22Rik 195 195 218 166 201 191 325 262 300 288 253 170
## 0610009E02Rik 10 7 5 1 2 4 137 20 9 6 14 75
## 0610009L18Rik 149 130 175 99 159 120 41 41 51 41 32 43
## 0610010F05Rik 123 117 106 107 98 120 499 512 508 594 512 445
## 0610010K14Rik 2 0 3 0 2 0 1128 882 941 944 1064 1216
Here, every row represents one gene and every column represents one sample. A sample in this case is a pooled collection of ventricle tissue. The numbers in the matrix represent the number of RNA fragments, or reads, detected for each gene in each sample. For example, the sample named “VA3” has 175 reads detected for the gene “0610009L18Rik”.
Finally, let us look at our sample metadata.
expt_metadata[1:5,] #first five rows of our metadata
## orig.ident nCount_RNA nFeature_RNA Sample Treatment Age
## VA1 SeuratProject 8918135 14784 VA1 Vehicle Adult
## VA2 SeuratProject 8706077 14954 VA2 Vehicle Adult
## VA3 SeuratProject 8030799 14842 VA3 Vehicle Adult
## DA1 SeuratProject 6006648 14605 DA1 20mgkg_at_E65to95 Adult
## DA2 SeuratProject 7418558 14756 DA2 20mgkg_at_E65to95 Adult
## Source Group
## VA1 Public_Expt Adult_Vehicle
## VA2 Public_Expt Adult_Vehicle
## VA3 Public_Expt Adult_Vehicle
## DA1 Public_Expt Adult_20mgkg_at_E65to95
## DA2 Public_Expt Adult_20mgkg_at_E65to95
The format here is a more traditional data table, where every row is one sample and every column a metadata variable. Let us explore this data.
unique(expt_metadata$Treatment) #let's see what treatment groups we have
## [1] "Vehicle" "20mgkg_at_E65to95"
table(expt_metadata$Treatment, expt_metadata$Age) #let's see the sample size of our data between treatment types and age
##
## Adult Embryonic
## 20mgkg_at_E65to95 3 3
## Vehicle 3 3
We can see with the table function that there are three samples for each group between Vehicle, Treated, Adult, and Embryonic.
Now that we have an understanding of our data and metadata, we can move on to our analysis, which we will perform with Seurat functions.
The first step is normalization, which is important to not confound increased gene expression because of biological reasons with increased RNA captured in certain samples due to technical variation.
Seurat_object <- NormalizeData(Seurat_object,
normalization.method = "LogNormalize",
scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million, which is usually done with bulk RNAseq data
Seurat_object@assays$RNA@data[1:10,1:5] # note our normalized data is stored here, in "@data"
## 10 x 5 sparse Matrix of class "dgCMatrix"
## VA1 VA2 VA3 DA1 DA2
## 0610009B22Rik 3.1296318 3.1526569 3.3373872 3.3546663 3.3355637
## 0610009E02Rik 0.7520341 0.5900264 0.4840317 0.1539926 0.2386973
## 0610009L18Rik 2.8739899 2.7683359 3.1263704 2.8611568 3.1105213
## 0610010F05Rik 2.6940947 2.6699252 2.6531846 2.9345798 2.6539540
## 0610010K14Rik 0.2023383 . 0.3174072 . 0.2386973
## 0610012D04Rik 0.5144907 0.2067902 0.1173568 . .
## 0610025J13Rik 0.7520341 0.7648272 0.8627746 1.1519220 1.1495441
## 0610030E20Rik 2.6152630 2.8312030 3.0167702 2.8321711 2.8115968
## 0610039K10Rik 0.5144907 0.2960868 0.1173568 0.1539926 0.1264539
## 0610040B10Rik 1.1047134 1.3828935 1.4729436 1.6413348 1.3430577
We just converted our raw count values to counts-per-million (cpm), saved under “@data” (under “@assays” and “$RNA”), distinct from “@counts”. Running “NormalizeData()” in this case did not overwrite any of our data. However, if we were to rerun this function with different parameters, then the “@data” information would be overwritten.
Now we can look for differentially expressed genes (DEG’s) between treated and control samples. However, if we do so with both adult and embryonic samples at once, we will likely get confusing or misleading results. Thus we subset the data first.
Test_object <- subset(Seurat_object, subset = Age == "Embryonic") # subset for embryonic samples
Ideally, we would do DE testing with DESeq2 (and the functions in the DESeq2 package), but due to the robustness and ease-of-use of Seurat, we will use Seurat and its adapted DESeq2 test (if DESeq2 installed without issues) or the Wilcoxon rank sum test (whic is default in Seurat) if there were any issues with installing the DESeq2 package.
Idents(Test_object) <- "Treatment" # we set "Treatment" as the active identity via "Idents()", this lets Seurat know we're using the "Treatment" variable to define our comparison groups
results <- FindMarkers(Test_object,
ident.1 = "20mgkg_at_E65to95", # what group are we looking for DEG's for
logfc.threshold = 1.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
min.pct = .25, # what proportion of "20mgkg_at_E65to95" samples do we want to express a gene in order for us to consider testing it
only.pos = F, # do we want only genes that are enriched/greater in "ident.1"
test.use = "DESeq2") # statistical/testing method we want to use
results[1:5,] # see first 5 results
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## Gm17494 4.902857e-18 -2.249799 1 1 9.673827e-14
## Gm13341 1.654821e-17 -1.517675 1 1 3.265127e-13
## Gm11407 1.067218e-14 -1.692259 1 1 2.105727e-10
## Gm14287 2.735318e-13 -3.537777 0 1 5.397055e-09
## Gm5940 8.937089e-13 -1.527259 1 1 1.763377e-08
We can how take a look at the genes that turn up in our result.
sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
pos_results <- sig_results[sig_results$avg_log2FC > 0,] # filter for up-regulated genes
Rpl26 and Kpna2 are two highly upregulated genes in our results. Let us visualize their expression values.
VlnPlot(Test_object, features = c("Rpl26", "Kpna2"))
And that’s it! We can save our results as csv files.
write.csv(results, "./results.csv", row.names = F) # note you don't actually need the "./" part here