This is a little mini summer project that I put together to highlight some skills I’ve learned as a beginning student of bioinformatics. In the past, I’ve used limma for very basic DE analysis on miRNA, but this is the first time that I am using Rstudio/markdown to perform DE analysis using DESeq2. I want to give thanks to Joshua Vandenbrink; he runs a youtube channel that contains a lot of R/computational biology tutorials, and I would not have been able to perform this analysis without his guidance. Sorry if the whole thing is a little rough around the edges, but this is more of a trial run than anything.
The data used in this project is mouse RNA-seq reads from Fu et al. 2015. Both the raw data and counts can be downloaded from thd GEO database under accession number GSE60450.
In the study where the data was pulled from, the expression patterns of basal and luminal cells in the mammary gland were investigated across different stages of mouse reproductive cycle. The experiment included six groups, and two biological replicates were utilized, involving independent sorting of cells from mammary glands of virgin, pregnant, and lactating mice.
For the purpouse of my analysis, I dont really care about replicating anyones results or doing anything too ground breaking. In theory, Basal and Luminal cells should have some pretty diffrent genes being expresed, so thats why I figured this data would be a good trial run.
So I split up the process into 2 parts. Part 1 is the actual DESeq2 analysis of the data. Here we’re gonna process the data, set up the groups for analysis, normalise, clean, and run DESeq2.
Lets load in the the reqired packages for this analysis. Note that a lot of the packages installed come from Bioconductor, so the installation step may look something like this:
if (!require(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)
BiocManager::install(“DESeq2”)
library("DESeq2")
library("dplyr")
library("apeglm")
library("tidyverse")
library("ggplot2")
Now we load in the data:
count_data <- read.csv("C:\\Users\\12142\\OneDrive\\Documents\\Bioinformatics\\Learning\\mouse_mammary_tutorial.csv")
samp_info <-read.csv("C:\\Users\\12142\\OneDrive\\Documents\\Bioinformatics\\Learning\\Samp_mice_info_data.csv", header = T)
This chunk of code creates a df called colData. This is an important peice of meta data that sets up the analysis by differentiating what 2 groups will be tested against each other.
new_df <- samp_info %>%
select(FileName, CellType) %>%
mutate(Group = ifelse(CellType == "luminal", 1, 0))
colData <- new_df %>%
select(-CellType) %>%
column_to_rownames(var = 'FileName')
colData$Group <- factor(colData$Group, levels = c('0', '1'))
DESeq2 only allows integers as an input. Whether or not data has decimals in it depends on method of alignment that was used upstream in the primary sequencing. In addition, the data provided has some existing columns like ‘length’ which will not be needed for the following analysis.
count_data <- count_data %>%
select(., !Length) %>%
column_to_rownames(var = "EntrezGeneID")
count_data[, 1:12] <- sapply(count_data[, 1:12], as.integer)
The following code sets up and actually performs DEseq analysis.
colData$Group <- factor(colData$Group)
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = colData, design = ~Group)
dds <- dds[rowSums(counts(dds)>2) >=4]
dds <- DESeq(dds)
res <- results(dds)
The following code chunk is included to refine the DESeq2 analysis, reducing noise and promoting biologically relevant data.
The lfcShrink function is applied to the dds object, which contains the DEresults. This function applies shrinkage estimation to the log fold changes to reduce noise and improve the stability of the estimates.
resLFC <- lfcShrink(dds, coef = 2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
The results are ordered by adjusted p-values in ascending order.
(resOrdered <- res[order(res$padj), ])
## log2 fold change (MLE): Group 1 vs 0
## Wald test p-value: Group 1 vs 0
## DataFrame with 17115 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 94352 332.797 -5.01959 0.747887 -6.71169 1.92381e-11 1.61436e-07
## 57435 1173.316 -5.35480 0.796186 -6.72557 1.74908e-11 1.61436e-07
## 52150 1402.498 5.57065 0.906812 6.14311 8.09198e-10 4.11797e-06
## 16956 14326.838 6.26951 1.025704 6.11240 9.81462e-10 4.11797e-06
## 21906 139.876 6.69694 1.123705 5.95970 2.52702e-09 6.05871e-06
## ... ... ... ... ... ... ...
## 14404 21.1495 -0.781582 0.994879 -0.785605 NA NA
## 107528 137.0134 -1.927262 0.796926 -2.418371 NA NA
## 69693 172.1455 0.144645 1.048361 0.137973 NA NA
## 66995 60.0717 -1.792925 1.339495 -1.338508 NA NA
## 210297 101.9928 -2.375874 1.247873 -1.903938 NA NA
write.csv(as.data.frame(resOrdered), file = "MouseMAM_DESeq.csv")
summary(res)
##
## out of 17115 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 3436, 20%
## LFC < 0 (down) : 3268, 19%
## outliers [1] : 332, 1.9%
## low counts [2] : 0, 0%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
MA plots are commonly used to visualize the relationship between the log fold change and the average expression levels of genes. This plot is particularly useful for detecting biases, identifying outliers, and assessing the quality of the DE results.
The MA plot shown has 2 different colored points, grey and blue. These colors signify whether or not the genes had significant p-value. If a gene is not statistically significant in terms of its p-value, it suggests that there is no strong evidence to support a significant difference in expression levels between the two groups.
plotMA(resLFC, ylim = c(-5,5))
Figure 1: Basal vs Luminal Cells MA Plot
pdf(file = 'MA_Plot_BAS_VS_LUM.pdf')
dev.off()
## png
## 2
A PCA (Principal Component Analysis) plot is used to display the relationships and patterns within high-dimensional data. To be honest at the time of writing this I am not too sure what the axis represent, but I know that they have something to to with calculating signifigant patterns of variationin the data. Despite my ignorance, the pca plot actually looks good. Using the eyeball test, we can see some pretty good grouping between within basal and luminal cells (with the exeption of a few outliers.)
dds <- estimateSizeFactors(dds)
se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) + 1), colData = colData(dds))
pca_plot <- plotPCA(DESeqTransform(se), intgroup = "Group", returnData = TRUE)
pca_plot$Group <- ifelse(pca_plot$Group == 0, "Basal Cells", "Luminal Cells")
ggplot(pca_plot, aes(x = PC1, y = PC2, color = Group)) +
geom_point() +
labs(title = "PCA Plot") +
theme_minimal()
So this part was included on the base tutorial that I completed, and It’s pretty cool. So basically, the row names of the results df that was created earlier are titled by EntrezGeneIDs. These gene IDs are stored in a database, and we can basically pull the pathways that are involved in eacho of them, and display pathways that map where our DE genes lie.
library(AnnotationDbi)
library(org.Mm.eg.db)
library(pathview)
library(gage)
library(gageData)
library(imager)
In this next step, we are going to add to the results df by mapping the Entrez iDs of our data to the org.Mm.eg.db database. For example, when mapping id to gene name, id “47097” maps to the “X-linked Kx blood group related 4” gene. so all of the subsequent tabs are the diffrent annotations I chose to include, note that all possible catagories are included in “Avalible Columns”.
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
res$symbol = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "SYMBOL",
keytype = "ENTREZID",
multivals = 'first')
## 'select()' returned 1:1 mapping between keys and columns
Honsestly this step seems redundant because the row names of the res df is already entrezids, but in order for the next couple of steps to work, we need a column of ids so just work with me here. Im sure there is a better way to do this but honestly im probably the only one who is going to see this.
res$entrez = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = 'ENTREZID',
keytype = "ENTREZID",
multiVals = 'first')
res$ensembl = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = 'ENSEMBL',
keytype = "ENTREZID",
multiVals = 'first')
## 'select()' returned 1:many mapping between keys and columns
res$name = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = 'GENENAME',
keytype = 'ENTREZID',
multiVals = 'first')
## 'select()' returned 1:1 mapping between keys and columns
Here is what the following code chunk does:
Loads the kegg.sets.mm and go.subs.mm data objects. These are datasets containing information related to KEGG pathways and gene ontology terms.
Assigns the log fold change values from the results df to foldchanges. It also assigns the corresponding Entrez gene iDs from the results to the column names of the foldchanges data frame.
Retrieves KEGG pathway information for the mouse species using the kegg.gsets() function from the gage package. The id.type argument specifies that the gene IDs used for mapping are Entrez gene IDs.
Filters the KEGG pathway gene sets to select only the signifigant pathways based on the significance indices (sigmet.idx) obtained from the kegg.mm object. The filtered gene sets are assigned to the variable kegg.mm.sigmet.
foldchanges <- as.data.frame(res$log2FoldChange,
row.names = row.names(res))
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez
kegg.mm <- kegg.gsets('mouse', id.type = 'entrez')
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]
keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)
keggres_df <- lapply(keggres, head)
print(keggres_df)
## $greater
## p.geomean stat.mean
## mmu00190 Oxidative phosphorylation 3.634977e-07 5.153766
## mmu01212 Fatty acid metabolism 3.997815e-05 4.091224
## mmu01200 Carbon metabolism 9.178441e-05 3.806625
## mmu04141 Protein processing in endoplasmic reticulum 1.802763e-04 3.609983
## mmu00630 Glyoxylate and dicarboxylate metabolism 3.457800e-04 3.589821
## mmu01040 Biosynthesis of unsaturated fatty acids 6.371377e-04 3.386699
## p.val q.val
## mmu00190 Oxidative phosphorylation 3.634977e-07 8.723945e-05
## mmu01212 Fatty acid metabolism 3.997815e-05 4.797378e-03
## mmu01200 Carbon metabolism 9.178441e-05 7.342753e-03
## mmu04141 Protein processing in endoplasmic reticulum 1.802763e-04 1.081658e-02
## mmu00630 Glyoxylate and dicarboxylate metabolism 3.457800e-04 1.659744e-02
## mmu01040 Biosynthesis of unsaturated fatty acids 6.371377e-04 2.016511e-02
## set.size exp1
## mmu00190 Oxidative phosphorylation 124 3.634977e-07
## mmu01212 Fatty acid metabolism 60 3.997815e-05
## mmu01200 Carbon metabolism 109 9.178441e-05
## mmu04141 Protein processing in endoplasmic reticulum 165 1.802763e-04
## mmu00630 Glyoxylate and dicarboxylate metabolism 30 3.457800e-04
## mmu01040 Biosynthesis of unsaturated fatty acids 32 6.371377e-04
##
## $less
## p.geomean stat.mean p.val
## mmu04510 Focal adhesion 2.681356e-07 -5.100224 2.681356e-07
## mmu04921 Oxytocin signaling pathway 3.080430e-05 -4.074030 3.080430e-05
## mmu04512 ECM-receptor interaction 7.780766e-05 -3.880085 7.780766e-05
## mmu04020 Calcium signaling pathway 1.485034e-04 -3.653060 1.485034e-04
## mmu04360 Axon guidance 2.048363e-04 -3.569629 2.048363e-04
## mmu04022 cGMP-PKG signaling pathway 2.137789e-04 -3.566106 2.137789e-04
## q.val set.size exp1
## mmu04510 Focal adhesion 6.435255e-05 194 2.681356e-07
## mmu04921 Oxytocin signaling pathway 3.696516e-03 132 3.080430e-05
## mmu04512 ECM-receptor interaction 6.224612e-03 83 7.780766e-05
## mmu04020 Calcium signaling pathway 7.922216e-03 193 1.485034e-04
## mmu04360 Axon guidance 7.922216e-03 169 2.048363e-04
## mmu04022 cGMP-PKG signaling pathway 7.922216e-03 141 2.137789e-04
##
## $stats
## stat.mean exp1
## mmu00190 Oxidative phosphorylation 5.153766 5.153766
## mmu01212 Fatty acid metabolism 4.091224 4.091224
## mmu01200 Carbon metabolism 3.806625 3.806625
## mmu04141 Protein processing in endoplasmic reticulum 3.609983 3.609983
## mmu00630 Glyoxylate and dicarboxylate metabolism 3.589821 3.589821
## mmu01040 Biosynthesis of unsaturated fatty acids 3.386699 3.386699
It is important to note that “greater” reffers to upregulated pathways, while “less” refers to down regulated. Im going to save the pathways below.
UpRegPath <- keggres$greater
DownRegPath <- keggres$less
Here I’m going to save the top 5, it’s a benign number, idealy you would pick based of of pval or whatever, but this is just an example.
keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
as_tibble() %>%
filter(row_number() <= 5) %>%
.$id %>%
as.character()
keggrespathways
## [1] "mmu00190 Oxidative phosphorylation"
## [2] "mmu01212 Fatty acid metabolism"
## [3] "mmu01200 Carbon metabolism"
## [4] "mmu04141 Protein processing in endoplasmic reticulum"
## [5] "mmu00630 Glyoxylate and dicarboxylate metabolism"
keggresids <- substr(keggrespathways, start = 1, stop = 8)
Note that I left 2 out, its just because for whatever reason, they returt 404 errors, it could be due to a lot of reasons, but leaving them out is just easier. This dosnet make them any less important.
keggresids <- keggresids[!(keggresids %in% c("mmu01200", "mmu01212"))]
plot_pathway <- function(pid) {
tryCatch({
pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse', new.signature = FALSE)
}, error = function(e) {
cat("An error occurred while plotting pathway", pid, "\n")
})
}
tmp <- sapply(keggresids, function(pid) plot_pathway(pid))
Here I’m going to save the top 5, it’s a benign number, idealy you would pick based of of pval or whatever, but this is just an example.
keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
as_tibble() %>%
filter(row_number() <= 5) %>%
.$id %>%
as.character()
keggrespathways
## [1] "mmu04510 Focal adhesion" "mmu04921 Oxytocin signaling pathway"
## [3] "mmu04512 ECM-receptor interaction" "mmu04020 Calcium signaling pathway"
## [5] "mmu04360 Axon guidance"
keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu04510" "mmu04921" "mmu04512" "mmu04020" "mmu04360"
All 5 pathways download here so it makes things a bit easier.
plot_pathway <- function(pid) {
tryCatch({
pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse', new.signature = FALSE)
}, error = function(e) {
cat("An error occurred while plotting pathway", pid, "\n")
})
}
tmp <- sapply(keggresids, function(pid) plot_pathway(pid))
Ok so thats pretty much it for this time, I tried to figure out a way to make the upregulated stuff go on the upregulated tab to make it clean/easy, but I’m tired lol. So im just gonna list the shown pathways below.
Up Regulated Pathways -
Down Regulated Pathways -
filenames <- list.files(path = "C:/Users/12142/OneDrive/Documents/Bioinformatics", pattern = '.*pathview.png')
all_images <- lapply(filenames, load.image)
knitr::include_graphics(filenames)