Introduction

This is a little mini summer project that I put together to highlight some skills I’ve learned as a beginning student of bioinformatics. In the past, I’ve used limma for very basic DE analysis on miRNA, but this is the first time that I am using Rstudio/markdown to perform DE analysis using DESeq2. I want to give thanks to Joshua Vandenbrink; he runs a youtube channel that contains a lot of R/computational biology tutorials, and I would not have been able to perform this analysis without his guidance. Sorry if the whole thing is a little rough around the edges, but this is more of a trial run than anything.

The data used in this project is mouse RNA-seq reads from Fu et al. 2015. Both the raw data and counts can be downloaded from thd GEO database under accession number GSE60450.

In the study where the data was pulled from, the expression patterns of basal and luminal cells in the mammary gland were investigated across different stages of mouse reproductive cycle. The experiment included six groups, and two biological replicates were utilized, involving independent sorting of cells from mammary glands of virgin, pregnant, and lactating mice.

For the purpouse of my analysis, I dont really care about replicating anyones results or doing anything too ground breaking. In theory, Basal and Luminal cells should have some pretty diffrent genes being expresed, so thats why I figured this data would be a good trial run.

Differental Expression pt.1 - The Nitty Grity

So I split up the process into 2 parts. Part 1 is the actual DESeq2 analysis of the data. Here we’re gonna process the data, set up the groups for analysis, normalise, clean, and run DESeq2.

Setup

Packages

Lets load in the the reqired packages for this analysis. Note that a lot of the packages installed come from Bioconductor, so the installation step may look something like this:

if (!require(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)

BiocManager::install(“DESeq2”)

library("DESeq2")
library("dplyr")
library("apeglm")
library("tidyverse")
library("ggplot2")

Data

Now we load in the data:

count_data <- read.csv("C:\\Users\\12142\\OneDrive\\Documents\\Bioinformatics\\Learning\\mouse_mammary_tutorial.csv")

samp_info <-read.csv("C:\\Users\\12142\\OneDrive\\Documents\\Bioinformatics\\Learning\\Samp_mice_info_data.csv", header = T)

Cleaning

This chunk of code creates a df called colData. This is an important peice of meta data that sets up the analysis by differentiating what 2 groups will be tested against each other.

“0” represents basal cells
“1” represents luminal cells

new_df <- samp_info %>%
  select(FileName, CellType) %>%
  mutate(Group = ifelse(CellType == "luminal", 1, 0))

colData <- new_df %>%
  select(-CellType) %>%
  column_to_rownames(var = 'FileName')

colData$Group <- factor(colData$Group, levels = c('0', '1'))

DESeq2 only allows integers as an input. Whether or not data has decimals in it depends on method of alignment that was used upstream in the primary sequencing. In addition, the data provided has some existing columns like ‘length’ which will not be needed for the following analysis.

count_data <- count_data %>%
  select(., !Length) %>%
  column_to_rownames(var = "EntrezGeneID")

count_data[, 1:12] <- sapply(count_data[, 1:12], as.integer)

DESeq2

The following code sets up and actually performs DEseq analysis.

The first step converts the group data (bas vs lum) into a factor.
Next, a DESeqDataSet object is created using the “DESeqDataSetFromMatrix” function. This takes the count datamand the sample information as inputs, and a design formula (~Group) is specified to define the groups of interest.
After creating the DESeqDataSet, filtering is performed using the rowSums function to exclude genes with low expression or low variation. Rows are retained only if the sum of counts across all samples is greater than 2 and if at least 4 samples have counts greater than 2.
DESeq is applied
The results are saved in ‘res’

colData$Group <- factor(colData$Group)

dds <- DESeqDataSetFromMatrix(countData = count_data, colData = colData, design = ~Group)

dds <- dds[rowSums(counts(dds)>2) >=4]

dds <- DESeq(dds)

res <- results(dds)

Statistical Results

The following code chunk is included to refine the DESeq2 analysis, reducing noise and promoting biologically relevant data.

lfcShrink

The lfcShrink function is applied to the dds object, which contains the DEresults. This function applies shrinkage estimation to the log fold changes to reduce noise and improve the stability of the estimates.

resLFC <- lfcShrink(dds, coef = 2)

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

Ordering

The results are ordered by adjusted p-values in ascending order.

(resOrdered <- res[order(res$padj), ])

## log2 fold change (MLE): Group 1 vs 0 
## Wald test p-value: Group 1 vs 0 
## DataFrame with 17115 rows and 6 columns
##         baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##        <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## 94352    332.797       -5.01959  0.747887  -6.71169 1.92381e-11 1.61436e-07
## 57435   1173.316       -5.35480  0.796186  -6.72557 1.74908e-11 1.61436e-07
## 52150   1402.498        5.57065  0.906812   6.14311 8.09198e-10 4.11797e-06
## 16956  14326.838        6.26951  1.025704   6.11240 9.81462e-10 4.11797e-06
## 21906    139.876        6.69694  1.123705   5.95970 2.52702e-09 6.05871e-06
## ...          ...            ...       ...       ...         ...         ...
## 14404    21.1495      -0.781582  0.994879 -0.785605          NA          NA
## 107528  137.0134      -1.927262  0.796926 -2.418371          NA          NA
## 69693   172.1455       0.144645  1.048361  0.137973          NA          NA
## 66995    60.0717      -1.792925  1.339495 -1.338508          NA          NA
## 210297  101.9928      -2.375874  1.247873 -1.903938          NA          NA

write.csv(as.data.frame(resOrdered), file = "MouseMAM_DESeq.csv")

Summary

summary(res)

## 
## out of 17115 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 3436, 20%
## LFC < 0 (down)     : 3268, 19%
## outliers [1]       : 332, 1.9%
## low counts [2]     : 0, 0%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Plots

MA plot

MA plots are commonly used to visualize the relationship between the log fold change and the average expression levels of genes. This plot is particularly useful for detecting biases, identifying outliers, and assessing the quality of the DE results.

The MA plot shown has 2 different colored points, grey and blue. These colors signify whether or not the genes had significant p-value. If a gene is not statistically significant in terms of its p-value, it suggests that there is no strong evidence to support a significant difference in expression levels between the two groups.

plotMA(resLFC, ylim = c(-5,5))

Figure 1: Basal vs Luminal Cells MA Plot

pdf(file = 'MA_Plot_BAS_VS_LUM.pdf')

dev.off()

## png 
##   2

PCA plot

A PCA (Principal Component Analysis) plot is used to display the relationships and patterns within high-dimensional data. To be honest at the time of writing this I am not too sure what the axis represent, but I know that they have something to to with calculating signifigant patterns of variationin the data. Despite my ignorance, the pca plot actually looks good. Using the eyeball test, we can see some pretty good grouping between within basal and luminal cells (with the exeption of a few outliers.)

dds <- estimateSizeFactors(dds)

se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) + 1), colData = colData(dds))

pca_plot <- plotPCA(DESeqTransform(se), intgroup = "Group", returnData = TRUE)

pca_plot$Group <- ifelse(pca_plot$Group == 0, "Basal Cells", "Luminal Cells")

ggplot(pca_plot, aes(x = PC1, y = PC2, color = Group)) +
  geom_point() +
  labs(title = "PCA Plot") +
  theme_minimal()

Differential Expression Pt.2 : Pretty Pictures

So this part was included on the base tutorial that I completed, and It’s pretty cool. So basically, the row names of the results df that was created earlier are titled by EntrezGeneIDs. These gene IDs are stored in a database, and we can basically pull the pathways that are involved in eacho of them, and display pathways that map where our DE genes lie.

Packages

library(AnnotationDbi)
library(org.Mm.eg.db)
library(pathview)
library(gage)
library(gageData)
library(imager)

Annotation Setup

In this next step, we are going to add to the results df by mapping the Entrez iDs of our data to the org.Mm.eg.db database. For example, when mapping id to gene name, id “47097” maps to the “X-linked Kx blood group related 4” gene. so all of the subsequent tabs are the diffrent annotations I chose to include, note that all possible catagories are included in “Avalible Columns”.

Avalible Columns

columns(org.Mm.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

Gene Symbol

res$symbol = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "SYMBOL",
                    keytype = "ENTREZID",
                    multivals = 'first')

## 'select()' returned 1:1 mapping between keys and columns

Entrez ID

Honsestly this step seems redundant because the row names of the res df is already entrezids, but in order for the next couple of steps to work, we need a column of ids so just work with me here. Im sure there is a better way to do this but honestly im probably the only one who is going to see this.

res$entrez = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = 'ENTREZID',
                    keytype = "ENTREZID",
                    multiVals = 'first')

Ensembl ID

res$ensembl = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = 'ENSEMBL',
                    keytype = "ENTREZID",
                    multiVals = 'first')

## 'select()' returned 1:many mapping between keys and columns

Gene Name

res$name = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = 'GENENAME',
                    keytype = 'ENTREZID',
                    multiVals = 'first')

## 'select()' returned 1:1 mapping between keys and columns

Finding Relevant Pathways

Here is what the following code chunk does:

Loads the kegg.sets.mm and go.subs.mm data objects. These are datasets containing information related to KEGG pathways and gene ontology terms.
Assigns the log fold change values from the results df to foldchanges. It also assigns the corresponding Entrez gene iDs from the results to the column names of the foldchanges data frame.
Retrieves KEGG pathway information for the mouse species using the kegg.gsets() function from the gage package. The id.type argument specifies that the gene IDs used for mapping are Entrez gene IDs.
Filters the KEGG pathway gene sets to select only the signifigant pathways based on the significance indices (sigmet.idx) obtained from the kegg.mm object. The filtered gene sets are assigned to the variable kegg.mm.sigmet.

foldchanges <- as.data.frame(res$log2FoldChange,
                             row.names = row.names(res))
data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez

kegg.mm <- kegg.gsets('mouse', id.type = 'entrez')
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]

keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

keggres_df <- lapply(keggres, head)
print(keggres_df)

## $greater
##                                                         p.geomean stat.mean
## mmu00190 Oxidative phosphorylation                   3.634977e-07  5.153766
## mmu01212 Fatty acid metabolism                       3.997815e-05  4.091224
## mmu01200 Carbon metabolism                           9.178441e-05  3.806625
## mmu04141 Protein processing in endoplasmic reticulum 1.802763e-04  3.609983
## mmu00630 Glyoxylate and dicarboxylate metabolism     3.457800e-04  3.589821
## mmu01040 Biosynthesis of unsaturated fatty acids     6.371377e-04  3.386699
##                                                             p.val        q.val
## mmu00190 Oxidative phosphorylation                   3.634977e-07 8.723945e-05
## mmu01212 Fatty acid metabolism                       3.997815e-05 4.797378e-03
## mmu01200 Carbon metabolism                           9.178441e-05 7.342753e-03
## mmu04141 Protein processing in endoplasmic reticulum 1.802763e-04 1.081658e-02
## mmu00630 Glyoxylate and dicarboxylate metabolism     3.457800e-04 1.659744e-02
## mmu01040 Biosynthesis of unsaturated fatty acids     6.371377e-04 2.016511e-02
##                                                      set.size         exp1
## mmu00190 Oxidative phosphorylation                        124 3.634977e-07
## mmu01212 Fatty acid metabolism                             60 3.997815e-05
## mmu01200 Carbon metabolism                                109 9.178441e-05
## mmu04141 Protein processing in endoplasmic reticulum      165 1.802763e-04
## mmu00630 Glyoxylate and dicarboxylate metabolism           30 3.457800e-04
## mmu01040 Biosynthesis of unsaturated fatty acids           32 6.371377e-04
## 
## $less
##                                        p.geomean stat.mean        p.val
## mmu04510 Focal adhesion             2.681356e-07 -5.100224 2.681356e-07
## mmu04921 Oxytocin signaling pathway 3.080430e-05 -4.074030 3.080430e-05
## mmu04512 ECM-receptor interaction   7.780766e-05 -3.880085 7.780766e-05
## mmu04020 Calcium signaling pathway  1.485034e-04 -3.653060 1.485034e-04
## mmu04360 Axon guidance              2.048363e-04 -3.569629 2.048363e-04
## mmu04022 cGMP-PKG signaling pathway 2.137789e-04 -3.566106 2.137789e-04
##                                            q.val set.size         exp1
## mmu04510 Focal adhesion             6.435255e-05      194 2.681356e-07
## mmu04921 Oxytocin signaling pathway 3.696516e-03      132 3.080430e-05
## mmu04512 ECM-receptor interaction   6.224612e-03       83 7.780766e-05
## mmu04020 Calcium signaling pathway  7.922216e-03      193 1.485034e-04
## mmu04360 Axon guidance              7.922216e-03      169 2.048363e-04
## mmu04022 cGMP-PKG signaling pathway 7.922216e-03      141 2.137789e-04
## 
## $stats
##                                                      stat.mean     exp1
## mmu00190 Oxidative phosphorylation                    5.153766 5.153766
## mmu01212 Fatty acid metabolism                        4.091224 4.091224
## mmu01200 Carbon metabolism                            3.806625 3.806625
## mmu04141 Protein processing in endoplasmic reticulum  3.609983 3.609983
## mmu00630 Glyoxylate and dicarboxylate metabolism      3.589821 3.589821
## mmu01040 Biosynthesis of unsaturated fatty acids      3.386699 3.386699

It is important to note that “greater” reffers to upregulated pathways, while “less” refers to down regulated. Im going to save the pathways below.

UpRegPath <- keggres$greater
DownRegPath <- keggres$less

Up-Regulated Pathways

Pathways

Here I’m going to save the top 5, it’s a benign number, idealy you would pick based of of pval or whatever, but this is just an example.

keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  as_tibble() %>%
  filter(row_number() <= 5) %>%
  .$id %>%
  as.character()
keggrespathways

## [1] "mmu00190 Oxidative phosphorylation"                  
## [2] "mmu01212 Fatty acid metabolism"                      
## [3] "mmu01200 Carbon metabolism"                          
## [4] "mmu04141 Protein processing in endoplasmic reticulum"
## [5] "mmu00630 Glyoxylate and dicarboxylate metabolism"

keggresids <- substr(keggrespathways, start = 1, stop = 8)

Downloading and saving to file

Note that I left 2 out, its just because for whatever reason, they returt 404 errors, it could be due to a lot of reasons, but leaving them out is just easier. This dosnet make them any less important.

keggresids <- keggresids[!(keggresids %in% c("mmu01200", "mmu01212"))]

plot_pathway <- function(pid) {
  tryCatch({
    pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse', new.signature = FALSE)
  }, error = function(e) {
    cat("An error occurred while plotting pathway", pid, "\n")
  })
}

tmp <- sapply(keggresids, function(pid) plot_pathway(pid))

Down-Regulated Pathways

Pathways

Here I’m going to save the top 5, it’s a benign number, idealy you would pick based of of pval or whatever, but this is just an example.

keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
  as_tibble() %>%
  filter(row_number() <= 5) %>%
  .$id %>%
  as.character()
keggrespathways

## [1] "mmu04510 Focal adhesion"             "mmu04921 Oxytocin signaling pathway"
## [3] "mmu04512 ECM-receptor interaction"   "mmu04020 Calcium signaling pathway" 
## [5] "mmu04360 Axon guidance"

keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids

## [1] "mmu04510" "mmu04921" "mmu04512" "mmu04020" "mmu04360"

Downloading and saving to file

All 5 pathways download here so it makes things a bit easier.

plot_pathway <- function(pid) {
  tryCatch({
    pathview(gene.data = foldchanges, pathway.id = pid, species = 'mouse', new.signature = FALSE)
  }, error = function(e) {
    cat("An error occurred while plotting pathway", pid, "\n")
  })
}

tmp <- sapply(keggresids, function(pid) plot_pathway(pid))

Grande Finale

Ok so thats pretty much it for this time, I tried to figure out a way to make the upregulated stuff go on the upregulated tab to make it clean/easy, but I’m tired lol. So im just gonna list the shown pathways below.

Up Regulated Pathways -

Oxidative phosphorylation
Protein processing in endoplasmic reticulum
Glyoxylate and dicarboxylate metabolism

Down Regulated Pathways -

Focal adhesion
Oxytocin signaling pathway
ECM-receptor interaction
Calcium signaling pathway
Axon guidance

filenames <- list.files(path = "C:/Users/12142/OneDrive/Documents/Bioinformatics", pattern = '.*pathview.png')


all_images <- lapply(filenames, load.image)

knitr::include_graphics(filenames)

DESeq2 Personal Tutorial

Benji Lamp

2023-06-12

Introduction

Differental Expression pt.1 - The Nitty Grity

Setup

Packages

Data

Cleaning

DESeq2

Statistical Results

lfcShrink

Ordering

Summary

Plots

MA plot

PCA plot

Differential Expression Pt.2 : Pretty Pictures

Packages

Annotation Setup

Avalible Columns

Gene Symbol

Entrez ID

Ensembl ID

Gene Name

Finding Relevant Pathways

Up-Regulated Pathways

Pathways

Downloading and saving to file

Down-Regulated Pathways

Pathways

Downloading and saving to file

Grande Finale