This is an R-markdown document for PCL368 - Introduction to bioinformatics module.
Let’s start by loading all the packages we will need for this session (or install first, as necessary).
#install.packages('Seurat') # as needed
library(Seurat) # for most if not all of our work with scRNAseq, only this is needed for the first lecture
#if (!requireNamespace("BiocManager", quietly = TRUE)) # as needed
# install.packages("BiocManager")
#BiocManager::install("org.Mm.eg.db") # as needed
library(org.Mm.eg.db) # gene ontology database for mouse
#BiocManager::install("clusterProfiler") # as needed
library(clusterProfiler) # gene set enrichment analysis package
Now let’s load our data, which is a combined bulk-tissue RNAseq dataset from mouse embryonic or adult ventricles after gestational exposure to TU3678.
expt_metadata <- read.csv("./Mouse_data/combined_metadata.csv", row.names=1) # our metadata
expt_matrix <- read.csv("./Mouse_data/combined_count_matrix.csv", row.names=1) # our count matrix
When starting any new project or research, it is always important to understand the contents, characteristics and distribution of our data (or the data we will be collecting, if it was a brand new experiment). First, let’s take a look at the data we are importing.
expt_matrix[1:5,] # let's see the first 5 rows and columns of our count matrix
## Gene_name VA1 VA2 VA3 DA1 DA2 DA3 VE1 VE2 DE1 DE2 DE3 DEI
## 1 0610009B22Rik 195 195 218 166 201 191 325 262 300 288 253 170
## 2 0610009E02Rik 10 7 5 1 2 4 137 20 9 6 14 75
## 3 0610009L18Rik 149 130 175 99 159 120 41 41 51 41 32 43
## 4 0610010F05Rik 123 117 106 107 98 120 499 512 508 594 512 445
## 5 0610010K14Rik 2 0 3 0 2 0 1128 882 941 944 1064 1216
We have every row representing one gene (identified by the “Gene_name” first column) and every column (except the first) representing one sample. In our case, a sample is a pooled collection of ventricle tissue. In other RNAseq experiments, a sample can be one cell, a collection of cells in one test tube, or a chunk of tissue from one animal. The numbers in the main matrix represent the number of RNA fragments, or reads, detected for one gene in one sample. For example, in the fourth column and third row, the sample named “VA3” has 175 RNA reads detected for the gene “0610009L18Rik”. This is the general format of count matrices for RNAseq data.
Now let us look at the metadata.
expt_metadata[1:5,] #first five rows of our metadata
## Sample Treatment Age Source Group
## 2 VA1 Vehicle Adult Public_Expt Adult_Vehicle
## 3 VA2 Vehicle Adult Public_Expt Adult_Vehicle
## 4 VA3 Vehicle Adult Public_Expt Adult_Vehicle
## 5 DA1 20mgkg_at_E65to95 Adult Public_Expt Adult_20mgkg_at_E65to95
## 6 DA2 20mgkg_at_E65to95 Adult Public_Expt Adult_20mgkg_at_E65to95
As you can see, the format is similar to the count matrix, but now every row is one sample and every column a metadata variable. Often times there are many samples and types of samples. We can use various functions to see the breakdown of data.
unique(expt_metadata$Treatment) #let's see what treatment groups we have
## [1] "Vehicle" "20mgkg_at_E65to95"
table(expt_metadata$Treatment, expt_metadata$Age) #let's see the sample size of our data between treatment types and age
##
## Adult Embryonic
## 20mgkg_at_E65to95 3 3
## Vehicle 3 3
We can see from the table function that there are three samples for each group between Vehicle, Treated, Adult, and Embryonic.
Now that we have an understanding of our data and metadata, we can move on to creating our Seurat object, which we will use with Seurat functions for data processing and analysis.
Seurat matches the identity of cells between our metadata and count matrix by row names and column names, respectively. So we will need to modify our dataframes a little.
row.names(expt_metadata) <- expt_metadata$Sample # set sample names to row names
expt_metadata[1:5, 1:5] # see the first 5 rows and columns of our metadata
## Sample Treatment Age Source Group
## VA1 VA1 Vehicle Adult Public_Expt Adult_Vehicle
## VA2 VA2 Vehicle Adult Public_Expt Adult_Vehicle
## VA3 VA3 Vehicle Adult Public_Expt Adult_Vehicle
## DA1 DA1 20mgkg_at_E65to95 Adult Public_Expt Adult_20mgkg_at_E65to95
## DA2 DA2 20mgkg_at_E65to95 Adult Public_Expt Adult_20mgkg_at_E65to95
All we need to do for the metadata is to set samples names (stored in expt_metadata$Sample) as the row names. Note how the row name numbers are now replaced with values from the “Sample” column.
We CAN remove the first column of sample names AFTER setting it as row names if we desire because the row names will be retained in our Seurat object. Leaving the column of sample names in the dataframe is also fine, this would just result in our Seurat object having a somewhat-redundant “Sample” column as an additional metadata column.
This is different from what we need to do with the count matrix.
row.names(expt_matrix) <- expt_matrix$Gene_name # set gene names to row names
expt_matrix <- as.matrix(expt_matrix[,-1]) # remove the first column of gene names, to leave us with a pure matrix
expt_matrix[1:5,] # see the first five rows of our transformed count matrix
## VA1 VA2 VA3 DA1 DA2 DA3 VE1 VE2 DE1 DE2 DE3 DEI
## 0610009B22Rik 195 195 218 166 201 191 325 262 300 288 253 170
## 0610009E02Rik 10 7 5 1 2 4 137 20 9 6 14 75
## 0610009L18Rik 149 130 175 99 159 120 41 41 51 41 32 43
## 0610010F05Rik 123 117 106 107 98 120 499 512 508 594 512 445
## 0610010K14Rik 2 0 3 0 2 0 1128 882 941 944 1064 1216
Our final count matrix is now a matrix object (as opposed to a dataframe), which is only possible with a purely numeric table. This is why we have to remove the “Gene_name” column (in contrast to how we can leave the Sample column in the metadata).
The sample names stored in the column names here match the sample names stored in the row names in our metadata dataframe. We can check this.
length(intersect(row.names(expt_metadata), # see how many (length function) names overlap (intersect function) between the row names (row.names function) of expt_metadata...
colnames(expt_matrix))) # ... and the column names (colnames function) of expt_matrix
## [1] 12
We can now create our Seurat object
Seurat_object <- CreateSeuratObject(count = expt_matrix,
meta.data = expt_metadata) # create our Seurat object
Seurat_object # see the number of genes (features) and cells (samples) in our object - it matches the corresponding dimensions of our initial data
## An object of class Seurat
## 19731 features across 12 samples within 1 assay
## Active assay: RNA (19731 features, 0 variable features)
Both our metadata and our count matrix information, matched by sample names, are now stored in the Seurat object.
Seurat_object@meta.data[,1:6] # see our metadata
## orig.ident nCount_RNA nFeature_RNA Sample Treatment Age
## VA1 SeuratProject 8918135 14784 VA1 Vehicle Adult
## VA2 SeuratProject 8706077 14954 VA2 Vehicle Adult
## VA3 SeuratProject 8030799 14842 VA3 Vehicle Adult
## DA1 SeuratProject 6006648 14605 DA1 20mgkg_at_E65to95 Adult
## DA2 SeuratProject 7418558 14756 DA2 20mgkg_at_E65to95 Adult
## DA3 SeuratProject 6997251 14792 DA3 20mgkg_at_E65to95 Adult
## VE1 SeuratProject 24977800 16900 VE1 Vehicle Embryonic
## VE2 SeuratProject 22332557 16900 VE2 Vehicle Embryonic
## DE1 SeuratProject 23473360 17045 DE1 20mgkg_at_E65to95 Embryonic
## DE2 SeuratProject 22869158 16958 DE2 20mgkg_at_E65to95 Embryonic
## DE3 SeuratProject 21601847 16980 DE3 20mgkg_at_E65to95 Embryonic
## DEI SeuratProject 23721425 17248 DEI Vehicle Embryonic
Seurat_object@assays$RNA@counts[1:5, ] # see the first 5 rows of our count matrix
## 5 x 12 sparse Matrix of class "dgCMatrix"
## [[ suppressing 12 column names 'VA1', 'VA2', 'VA3' ... ]]
##
## 0610009B22Rik 195 195 218 166 201 191 325 262 300 288 253 170
## 0610009E02Rik 10 7 5 1 2 4 137 20 9 6 14 75
## 0610009L18Rik 149 130 175 99 159 120 41 41 51 41 32 43
## 0610010F05Rik 123 117 106 107 98 120 499 512 508 594 512 445
## 0610010K14Rik 2 . 3 . 2 . 1128 882 941 944 1064 1216
Our data is stored in one Seurat object now. There are a few changes we should note.
Our metadata now has a few extra columns at the beginning, like n_Count_RNA and n_Feature_RNA, which represent the total number of reads and the number of detected genes in each sample. This was generated by Seurat.
Our count matrix is now stored as a sparse matrix, where 0’s are replaced with “.”. This is also done by Seurat, with the intention to save memory space.
Next class we will work on data processing and analysis within Seurat and other packages.