Introduction to bioinformatics

Setting up

This is an R-markdown document for the PCL368 - Introduction to bioinformatics module.

Let’s start by loading all the packages we will need for this session (or install first, as necessary).

#install.packages('Seurat') # as needed
library(Seurat) 
#install.packages("BiocManager")
#BiocManager::install("DESeq2") # as needed
library(DESeq2) # differential gene-expression test package; may not need to be loaded for our class, but definitely need to be installed to use its test

Note: we use Seurat for our lesson due to its ease-of-installation across different systems. Seurat is designed and used for single-cell (or single-nucleus) RNAseq analysis. Bulk RNAseq analysis is typically performed with DESeq2 directly, edgeR, or limma-voom.

Now let’s load our data, which is a combined bulk-tissue RNAseq dataset from mouse embryonic or adult ventricles after gestational exposure to TU3678.

Seurat_object <- readRDS("./Mouse_data/Seurat_object.rds")

expt_metadata <- Seurat_object@meta.data # our metadata
expt_matrix <- as.matrix(Seurat_object@assays$RNA@counts) # our count matrix, expanded

When starting any new project or research, it is always important to understand the contents, characteristics and distribution of our data. First, let’s take a look at the Seurat data structure. It is essentially a complex list with different, specifically-defined components.

Seurat_object

## An object of class Seurat 
## 19731 features across 12 samples within 1 assay 
## Active assay: RNA (19731 features, 0 variable features)
##  2 layers present: counts, data

Next let us look at the gene expression data we are working with.

expt_matrix[1:5,] # let's see the first 5 rows (genes/features) of our count matrix

##               VA1 VA2 VA3 DA1 DA2 DA3  VE1 VE2 DE1 DE2  DE3  DEI
## 0610009B22Rik 195 195 218 166 201 191  325 262 300 288  253  170
## 0610009E02Rik  10   7   5   1   2   4  137  20   9   6   14   75
## 0610009L18Rik 149 130 175  99 159 120   41  41  51  41   32   43
## 0610010F05Rik 123 117 106 107  98 120  499 512 508 594  512  445
## 0610010K14Rik   2   0   3   0   2   0 1128 882 941 944 1064 1216

Here, every row represents one gene and every column represents one sample. A sample in this case is a pooled collection of ventricle tissue. The numbers in the matrix represent the number of RNA fragments, or reads, detected for each gene in each sample. For example, the sample named “VA3” has 175 reads detected for the gene “0610009L18Rik”.

Finally, let us look at our sample metadata.

expt_metadata[1:5,] #first five rows of our metadata

##        orig.ident nCount_RNA nFeature_RNA Sample         Treatment   Age
## VA1 SeuratProject    8918135        14784    VA1           Vehicle Adult
## VA2 SeuratProject    8706077        14954    VA2           Vehicle Adult
## VA3 SeuratProject    8030799        14842    VA3           Vehicle Adult
## DA1 SeuratProject    6006648        14605    DA1 20mgkg_at_E65to95 Adult
## DA2 SeuratProject    7418558        14756    DA2 20mgkg_at_E65to95 Adult
##          Source                   Group
## VA1 Public_Expt           Adult_Vehicle
## VA2 Public_Expt           Adult_Vehicle
## VA3 Public_Expt           Adult_Vehicle
## DA1 Public_Expt Adult_20mgkg_at_E65to95
## DA2 Public_Expt Adult_20mgkg_at_E65to95

The format here is a more traditional data table, where every row is one sample and every column a metadata variable. Let us explore this data.

unique(expt_metadata$Treatment) #let's see what treatment groups we have

## [1] "Vehicle"           "20mgkg_at_E65to95"

table(expt_metadata$Treatment, expt_metadata$Age) #let's see the sample size of our data between treatment types and age

##                    
##                     Adult Embryonic
##   20mgkg_at_E65to95     3         3
##   Vehicle               3         3

We can see with the table function that there are three samples for each group between Vehicle, Treated, Adult, and Embryonic.

Differential expression testing with Seurat

Now that we have an understanding of our data and metadata, we can move on to our analysis, which we will perform with Seurat functions.

The first step is normalization, which is important to not confound increased gene expression because of biological reasons with increased RNA captured in certain samples due to technical variation.

Seurat_object <- NormalizeData(Seurat_object, 
                               normalization.method = "LogNormalize", 
                               scale.factor = 1000000) # we use a scale factor of 1 million to convert our values to counts-per-million, which is usually done with bulk RNAseq data

Seurat_object@assays$RNA@data[1:10,1:5] # note our normalized data is stored here, in "@data"

## 10 x 5 sparse Matrix of class "dgCMatrix"
##                     VA1       VA2       VA3       DA1       DA2
## 0610009B22Rik 3.1296318 3.1526569 3.3373872 3.3546663 3.3355637
## 0610009E02Rik 0.7520341 0.5900264 0.4840317 0.1539926 0.2386973
## 0610009L18Rik 2.8739899 2.7683359 3.1263704 2.8611568 3.1105213
## 0610010F05Rik 2.6940947 2.6699252 2.6531846 2.9345798 2.6539540
## 0610010K14Rik 0.2023383 .         0.3174072 .         0.2386973
## 0610012D04Rik 0.5144907 0.2067902 0.1173568 .         .        
## 0610025J13Rik 0.7520341 0.7648272 0.8627746 1.1519220 1.1495441
## 0610030E20Rik 2.6152630 2.8312030 3.0167702 2.8321711 2.8115968
## 0610039K10Rik 0.5144907 0.2960868 0.1173568 0.1539926 0.1264539
## 0610040B10Rik 1.1047134 1.3828935 1.4729436 1.6413348 1.3430577

We just converted our raw count values to counts-per-million (cpm), saved under “@data” (under “@assays” and “$RNA”), distinct from “@counts”. Running “NormalizeData()” in this case did not overwrite any of our data. However, if we were to rerun this function with different parameters, then the “@data” information would be overwritten.

Now we can look for differentially expressed genes (DEG’s) between treated and control samples. However, if we do so with both adult and embryonic samples at once, we will likely get confusing or misleading results. Thus we subset the data first.

Test_object <- subset(Seurat_object, subset = Age == "Embryonic") # subset for embryonic samples

Ideally, we would do DE testing with DESeq2 (and the functions in the DESeq2 package), but due to the robustness and ease-of-use of Seurat, we will use Seurat and its adapted DESeq2 test (if DESeq2 installed without issues) or the Wilcoxon rank sum test (whic is default in Seurat) if there were any issues with installing the DESeq2 package.

Idents(Test_object) <- "Treatment" # we set "Treatment" as the active identity via "Idents()", this lets Seurat know we're using the "Treatment" variable to define our comparison groups

results <- FindMarkers(Test_object, 
                       ident.1 = "20mgkg_at_E65to95", # what group are we looking for DEG's for
                       logfc.threshold = 1.5, # what fold-change difference minimum do we want to see between control and treated for us to consider testing it
                       min.pct = .25, # what proportion of "20mgkg_at_E65to95" samples do we want to express a gene in order for us to consider testing it
                       only.pos = F, # do we want only genes that are enriched/greater in "ident.1"
                       test.use = "DESeq2") # statistical/testing method we want to use

results[1:5,] # see first 5 results

##                p_val avg_log2FC pct.1 pct.2    p_val_adj
## Gm17494 4.902857e-18  -2.249799     1     1 9.673827e-14
## Gm13341 1.654821e-17  -1.517675     1     1 3.265127e-13
## Gm11407 1.067218e-14  -1.692259     1     1 2.105727e-10
## Gm14287 2.735318e-13  -3.537777     0     1 5.397055e-09
## Gm5940  8.937089e-13  -1.527259     1     1 1.763377e-08

We can how take a look at the genes that turn up in our result.

sig_results <- results[results$p_val < 0.05,] # filter for "significant" results
pos_results <- sig_results[sig_results$avg_log2FC > 0,] # filter for up-regulated genes

Rpl26 and Kpna2 are two highly upregulated genes in our results. Let us visualize their expression values.

VlnPlot(Test_object, features = c("Rpl26", "Kpna2"))

And that’s it! We can save our results as csv files.

write.csv(results, "./results.csv", row.names = F) # note you don't actually need the "./" part here

Introduction to bioinformatics

Sonny (Yuxiao) Chen, PhD

Mar 27, 2025

Setting up

Differential expression testing with Seurat