Introduciton to bioinformatics

Setting up

This is an R-markdown document for PCL368 - Introduction to bioinformatics module.

Let’s start by loading all the packages we will need for this session (or install first, as necessary).

#install.packages('Seurat') # as needed
library(Seurat) # for most if not all of our work with scRNAseq, only this is needed for the first lecture
#if (!requireNamespace("BiocManager", quietly = TRUE)) # as needed
#    install.packages("BiocManager")
#BiocManager::install("org.Mm.eg.db") # as needed
library(org.Mm.eg.db) # gene ontology database for mouse
#BiocManager::install("clusterProfiler") # as needed
library(clusterProfiler) # gene set enrichment analysis package

Now let’s load our data, which is a combined bulk-tissue RNAseq dataset from mouse embryonic or adult ventricles after gestational exposure to TU3678.

expt_metadata <- read.csv("./Mouse_data/combined_metadata.csv", row.names=1) # our metadata
expt_matrix <- read.csv("./Mouse_data/combined_count_matrix.csv", row.names=1) # our count matrix

When starting any new project or research, it is always important to understand the contents, characteristics and distribution of our data (or the data we will be collecting, if it was a brand new experiment). First, let’s take a look at the data we are importing.

expt_matrix[1:5,] # let's see the first 5 rows and columns of our count matrix

##       Gene_name VA1 VA2 VA3 DA1 DA2 DA3  VE1 VE2 DE1 DE2  DE3  DEI
## 1 0610009B22Rik 195 195 218 166 201 191  325 262 300 288  253  170
## 2 0610009E02Rik  10   7   5   1   2   4  137  20   9   6   14   75
## 3 0610009L18Rik 149 130 175  99 159 120   41  41  51  41   32   43
## 4 0610010F05Rik 123 117 106 107  98 120  499 512 508 594  512  445
## 5 0610010K14Rik   2   0   3   0   2   0 1128 882 941 944 1064 1216

We have every row representing one gene (identified by the “Gene_name” first column) and every column (except the first) representing one sample. In our case, a sample is a pooled collection of ventricle tissue. In other RNAseq experiments, a sample can be one cell, a collection of cells in one test tube, or a chunk of tissue from one animal. The numbers in the main matrix represent the number of RNA fragments, or reads, detected for one gene in one sample. For example, in the fourth column and third row, the sample named “VA3” has 175 RNA reads detected for the gene “0610009L18Rik”. This is the general format of count matrices for RNAseq data.

Now let us look at the metadata.

expt_metadata[1:5,] #first five rows of our metadata

##   Sample         Treatment   Age      Source                   Group
## 2    VA1           Vehicle Adult Public_Expt           Adult_Vehicle
## 3    VA2           Vehicle Adult Public_Expt           Adult_Vehicle
## 4    VA3           Vehicle Adult Public_Expt           Adult_Vehicle
## 5    DA1 20mgkg_at_E65to95 Adult Public_Expt Adult_20mgkg_at_E65to95
## 6    DA2 20mgkg_at_E65to95 Adult Public_Expt Adult_20mgkg_at_E65to95

As you can see, the format is similar to the count matrix, but now every row is one sample and every column a metadata variable. Often times there are many samples and types of samples. We can use various functions to see the breakdown of data.

unique(expt_metadata$Treatment) #let's see what treatment groups we have

## [1] "Vehicle"           "20mgkg_at_E65to95"

table(expt_metadata$Treatment, expt_metadata$Age) #let's see the sample size of our data between treatment types and age

##                    
##                     Adult Embryonic
##   20mgkg_at_E65to95     3         3
##   Vehicle               3         3

We can see from the table function that there are three samples for each group between Vehicle, Treated, Adult, and Embryonic.

Starting with Seurat

Now that we have an understanding of our data and metadata, we can move on to creating our Seurat object, which we will use with Seurat functions for data processing and analysis.

Seurat matches the identity of cells between our metadata and count matrix by row names and column names, respectively. So we will need to modify our dataframes a little.

row.names(expt_metadata) <- expt_metadata$Sample # set sample names to row names
expt_metadata[1:5, 1:5] # see the first 5 rows and columns of our metadata

##     Sample         Treatment   Age      Source                   Group
## VA1    VA1           Vehicle Adult Public_Expt           Adult_Vehicle
## VA2    VA2           Vehicle Adult Public_Expt           Adult_Vehicle
## VA3    VA3           Vehicle Adult Public_Expt           Adult_Vehicle
## DA1    DA1 20mgkg_at_E65to95 Adult Public_Expt Adult_20mgkg_at_E65to95
## DA2    DA2 20mgkg_at_E65to95 Adult Public_Expt Adult_20mgkg_at_E65to95

All we need to do for the metadata is to set samples names (stored in expt_metadata$Sample) as the row names. Note how the row name numbers are now replaced with values from the “Sample” column.

We CAN remove the first column of sample names AFTER setting it as row names if we desire because the row names will be retained in our Seurat object. Leaving the column of sample names in the dataframe is also fine, this would just result in our Seurat object having a somewhat-redundant “Sample” column as an additional metadata column.

This is different from what we need to do with the count matrix.

row.names(expt_matrix) <- expt_matrix$Gene_name # set gene names to row names
expt_matrix <- as.matrix(expt_matrix[,-1]) # remove the first column of gene names, to leave us with a pure matrix
expt_matrix[1:5,] # see the first five rows of our transformed count matrix

##               VA1 VA2 VA3 DA1 DA2 DA3  VE1 VE2 DE1 DE2  DE3  DEI
## 0610009B22Rik 195 195 218 166 201 191  325 262 300 288  253  170
## 0610009E02Rik  10   7   5   1   2   4  137  20   9   6   14   75
## 0610009L18Rik 149 130 175  99 159 120   41  41  51  41   32   43
## 0610010F05Rik 123 117 106 107  98 120  499 512 508 594  512  445
## 0610010K14Rik   2   0   3   0   2   0 1128 882 941 944 1064 1216

Our final count matrix is now a matrix object (as opposed to a dataframe), which is only possible with a purely numeric table. This is why we have to remove the “Gene_name” column (in contrast to how we can leave the Sample column in the metadata).

The sample names stored in the column names here match the sample names stored in the row names in our metadata dataframe. We can check this.

length(intersect(row.names(expt_metadata), # see how many (length function) names overlap (intersect function) between the row names (row.names function) of expt_metadata...
                 colnames(expt_matrix))) # ... and the column names (colnames function) of expt_matrix

## [1] 12

We can now create our Seurat object

Seurat_object <- CreateSeuratObject(count = expt_matrix, 
                                    meta.data = expt_metadata) # create our Seurat object

Seurat_object # see the number of genes (features) and cells (samples) in our object - it matches the corresponding dimensions of our initial data

## An object of class Seurat 
## 19731 features across 12 samples within 1 assay 
## Active assay: RNA (19731 features, 0 variable features)

Both our metadata and our count matrix information, matched by sample names, are now stored in the Seurat object.

Seurat_object@meta.data[,1:6] # see our metadata

##        orig.ident nCount_RNA nFeature_RNA Sample         Treatment       Age
## VA1 SeuratProject    8918135        14784    VA1           Vehicle     Adult
## VA2 SeuratProject    8706077        14954    VA2           Vehicle     Adult
## VA3 SeuratProject    8030799        14842    VA3           Vehicle     Adult
## DA1 SeuratProject    6006648        14605    DA1 20mgkg_at_E65to95     Adult
## DA2 SeuratProject    7418558        14756    DA2 20mgkg_at_E65to95     Adult
## DA3 SeuratProject    6997251        14792    DA3 20mgkg_at_E65to95     Adult
## VE1 SeuratProject   24977800        16900    VE1           Vehicle Embryonic
## VE2 SeuratProject   22332557        16900    VE2           Vehicle Embryonic
## DE1 SeuratProject   23473360        17045    DE1 20mgkg_at_E65to95 Embryonic
## DE2 SeuratProject   22869158        16958    DE2 20mgkg_at_E65to95 Embryonic
## DE3 SeuratProject   21601847        16980    DE3 20mgkg_at_E65to95 Embryonic
## DEI SeuratProject   23721425        17248    DEI           Vehicle Embryonic

Seurat_object@assays$RNA@counts[1:5, ] # see the first 5 rows of our count matrix

## 5 x 12 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 12 column names 'VA1', 'VA2', 'VA3' ... ]]

##                                                                 
## 0610009B22Rik 195 195 218 166 201 191  325 262 300 288  253  170
## 0610009E02Rik  10   7   5   1   2   4  137  20   9   6   14   75
## 0610009L18Rik 149 130 175  99 159 120   41  41  51  41   32   43
## 0610010F05Rik 123 117 106 107  98 120  499 512 508 594  512  445
## 0610010K14Rik   2   .   3   .   2   . 1128 882 941 944 1064 1216

Our data is stored in one Seurat object now. There are a few changes we should note.

Our metadata now has a few extra columns at the beginning, like n_Count_RNA and n_Feature_RNA, which represent the total number of reads and the number of detected genes in each sample. This was generated by Seurat.

Our count matrix is now stored as a sparse matrix, where 0’s are replaced with “.”. This is also done by Seurat, with the intention to save memory space.

Next class we will work on data processing and analysis within Seurat and other packages.

Introduciton to bioinformatics

Dr. Yuxiao (Sonny) Chen

Feb 03, 2022

Setting up

Starting with Seurat