03-differently-expressed-genes-DESeq2

Author

Sarah Tanja

Published

May 16, 2023

Running DESeq2

This code performs differential expression analysis to identify deferentially expressed genes (DEGs).

First, it reads in a count matrix of isoform counts generated by kallisto and trinity, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers.

The results() function is used to extract the results table, which is ordered by gene/transcript ID.

The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called “DEGlist.tab”.

Install Packages

Code

if ("DESeq2" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install('DESeq2')
if ("kableExtra" %in% rownames(installed.packages()) == 'FALSE') install.packages('kableExtra')

Load Libraries

Code

library(DESeq2)
library(kableExtra)
library(tidyverse)
library(ggplot2)

Read in count matrix

Code

countmatrix <- read.delim("../output/trinity-matrix/mcap.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)

This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The ‘count’ is the number of times that gene was found in the RNA sequence of the sample.

Round integers up to whole numbers for analysis

Code

countmatrix <- round(countmatrix, 0)
head(countmatrix)

Create Dataframe

Here the code creates a data.frame named deseq2.colData containing information about the experimental conditions (embryos & recruits). It uses the column data dataframe named deseq2.colData to create a DESeqDataSet object using the DESeqDataSetFromMatrix function from the DESeq2 package.

Code

# make a dataframe of 4 embryos and 4 recruits
deseq2.colData <- data.frame(condition=factor(c(rep("embryos", 4), rep("recruits", 4))), 
                             type=factor(rep("single-read", 8)))

# set row names to match the column names in the count matrix
rownames(deseq2.colData) <- colnames(data)

# DESeqDataSet object created using the `DESeqDataSetFromMatrix` function
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

Negative Binomial

The DESeqDataSet object, named deseq.dds, is then passed to the DESeq() function to fit a negative binomial model and estimate dispersions.

Code

deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
dim(deseq2.res)
summary(deseq2.res)

This code chunk subsets the deseq2.res data frame, selecting the rows where both of the following conditions are satisfied: !is.na(deseq2.res$padj) adjusted p-value is not missing
deseq2.res$padj <= 0.05 adjusted p-value is less than or equal to 0.05

In summary, the code selects rows from the deseq2.res data frame where the padj values are not missing and are less than or equal to 0.05. It then calculates the dimensions (number of rows and columns) of this subset, giving us the number of significantly differentially expressed genes between our comparative groups ‘embryos’ and ‘recruits’

Code

deseq2.sig.res <- (deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

head(deseq2.sig.res)

# calculate the dimensions of the resulting subset. The dim() function returns a vector with two elements: the number of rows and the number of columns in the subset.
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

Important

Results: Of the 54,384 rows (genes), 5,736 of them were significantly differentially expressed (p <= 0.05) between M. capitata embryos and recruits.

Plotting

PCA

Code

vsd <- vst(deseq2.dds, blind = FALSE)
pca <- plotPCA(vsd, intgroup = "condition")
pca

ggsave("../figs/pca.png", plot = pca, width = 8, height = 4, dpi = 600)

Heatmap

Install Packages

Code

if ("pheatmap" %in% rownames(installed.packages()) == 'FALSE') install.packages('pheatmap')

Load Libraries

Code

library(pheatmap)

Top 50 Differentially Expressed Genes

Code

# Select top 50 differentially expressed genes
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:20]

# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
heatmap_50 <- pheatmap(log_counts_top, scale = "row")

ggsave("../figs/heatmap_50.png", plot = heatmap_50, width = 8, height = 5, dpi = 600)

heatmap_50

Fancy Volcano

Code

# The main plot
fancy_volcano <- plot(deseq2.res$baseMean, deseq2.res$log2FoldChange, 
                          pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
                          main="DEG Coral Early Life History  (pval <= 0.05)",
                          xlab="mean of normalized counts",
                          ylab="Log2 Fold Change") 
                  points(deseq2.sig.res$baseMean, deseq2.sig.res$log2FoldChange, pch=20, cex=0.45, col="red") 
                  abline(h=c(-1,1), col="blue")

dev.print(png, file = "../figs/fancy-volcano.png", width = 6, height = 4, units = "in", res = 300)

Basic Volcano

Code

# Prepare the data for plotting
res_df <- as.data.frame(deseq2.res)
res_df$gene <- row.names(res_df)

# Create volcano plot
basic_volcano <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
                 geom_point(alpha = 0.6, size = 1.5) +
                 scale_color_manual(values = c("grey", "red")) +
                 labs(title = "Volcano Plot",
                        x = "Log2 Fold Change",
                        y = "-Log10 Adjusted P-value",
                        color = "Significantly\nDifferentially Expressed") +
                 theme_minimal() +
                 theme(panel.grid.major = element_blank(),
                       panel.grid.minor = element_blank(),
                       legend.position = "top")

# Save plot
ggsave(path = "../figs", filename="basic-volcano.png")

# Print plot
print(basic_volcano)

Save the list of Differentially Expressed Genes (DEGs)!

Write the output to a table

Code

mkdir -p ../output/deseq

Code

write.table(tmp.sig, "../output/deseq/DEGlist.tab", sep = '\t', row.names = T)

Let’s look at this list

Code

deglist <- read.csv("../output/DEGlist.tab", sep = '\t', header = TRUE)
deglist$RowName <- rownames(deglist)
deglist2 <- deglist[, c("RowName", "pvalue")] # Optionally, reorder the columns
head(deglist2)

Summary & Next Steps

Info

Then we take the list of differentially expressed genes located at ‘../output/deseq/DEGlist.tab’ and run it through the 04-functional-enrichment-GOSeqcode!

--- title: "03-differently-expressed-genes-DESeq2" author: "Sarah Tanja" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: df-print: paged toc: true smooth-scroll: true link-external-icon: true link-external-newwindow: true code-fold: show code-tools: true code-copy: true highlight-style: breeze code-overflow: wrap theme: light: flatly dark: darkly editor: visual --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # Running DESeq2 This code performs differential expression analysis to identify deferentially expressed genes (DEGs). First, it reads in a count matrix of isoform counts generated by `kallisto` and `trinity`, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers. The `results()` function is used to extract the results table, which is ordered by gene/transcript ID. The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called "DEGlist.tab". ## Install Packages ```{r install-packages} if ("DESeq2" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install('DESeq2') if ("kableExtra" %in% rownames(installed.packages()) == 'FALSE') install.packages('kableExtra') ``` ### Load Libraries ```{r load libraries} library(DESeq2) library(kableExtra) library(tidyverse) library(ggplot2) ``` ### Read in count matrix ```{r} countmatrix <- read.delim("../output/trinity-matrix/mcap.isoform.counts.matrix", header = TRUE, sep = '\t') rownames(countmatrix) <- countmatrix$X countmatrix <- countmatrix[,-1] head(countmatrix) ``` This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The 'count' is the number of times that gene was found in the RNA sequence of the sample. ### Round integers up to whole numbers for analysis ```{r} countmatrix <- round(countmatrix, 0) head(countmatrix) ``` ### Create Dataframe Here the code creates a `data.frame` named `deseq2.colData` containing information about the experimental conditions (embryos & recruits). It uses the column data dataframe named `deseq2.colData` to create a `DESeqDataSet` object using the `DESeqDataSetFromMatrix` function from the DESeq2 package. ```{r} # make a dataframe of 4 embryos and 4 recruits deseq2.colData <- data.frame(condition=factor(c(rep("embryos", 4), rep("recruits", 4))), type=factor(rep("single-read", 8))) # set row names to match the column names in the count matrix rownames(deseq2.colData) <- colnames(data) # DESeqDataSet object created using the `DESeqDataSetFromMatrix` function deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix, colData = deseq2.colData, design = ~ condition) ``` ### Negative Binomial The `DESeqDataSet` object, named `deseq.dds`, is then passed to the `DESeq()` function to fit a negative binomial model and estimate dispersions. ```{r} deseq2.dds <- DESeq(deseq2.dds) deseq2.res <- results(deseq2.dds) deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ] dim(deseq2.res) summary(deseq2.res) ``` This code chunk subsets the deseq2.res data frame, selecting the rows where both of the following conditions are satisfied: `!is.na(deseq2.res$padj)` adjusted p-value is not missing\ `deseq2.res$padj <= 0.05` adjusted p-value is less than or equal to 0.05 In summary, the code selects rows from the deseq2.res data frame where the padj values are not missing and are less than or equal to 0.05. It then calculates the dimensions (number of rows and columns) of this subset, giving us the number of significantly differentially expressed genes between our comparative groups 'embryos' and 'recruits' ```{r} deseq2.sig.res <- (deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]) head(deseq2.sig.res) # calculate the dimensions of the resulting subset. The dim() function returns a vector with two elements: the number of rows and the number of columns in the subset. dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]) ``` ::: callout-important *Results:* Of the 54,384 rows (genes), 5,736 of them were significantly differentially expressed (p \<= 0.05) between *M. capitata* embryos and recruits. ::: # Plotting ## PCA ```{r} vsd <- vst(deseq2.dds, blind = FALSE) pca <- plotPCA(vsd, intgroup = "condition") pca ggsave("../figs/pca.png", plot = pca, width = 8, height = 4, dpi = 600) ``` ## Heatmap ### Install Packages ```{r} if ("pheatmap" %in% rownames(installed.packages()) == 'FALSE') install.packages('pheatmap') ``` ### Load Libraries ```{r} library(pheatmap) ``` ### Top 50 Differentially Expressed Genes ```{r} # Select top 50 differentially expressed genes res <- results(deseq2.dds) res_ordered <- res[order(res$padj), ] top_genes <- row.names(res_ordered)[1:20] # Extract counts and normalize counts <- counts(deseq2.dds, normalized = TRUE) counts_top <- counts[top_genes, ] # Log-transform counts log_counts_top <- log2(counts_top + 1) # Generate heatmap heatmap_50 <- pheatmap(log_counts_top, scale = "row") ggsave("../figs/heatmap_50.png", plot = heatmap_50, width = 8, height = 5, dpi = 600) heatmap_50 ``` ## Fancy Volcano ```{r} ``` ```{r} # The main plot fancy_volcano <- plot(deseq2.res$baseMean, deseq2.res$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray", main="DEG Coral Early Life History (pval <= 0.05)", xlab="mean of normalized counts", ylab="Log2 Fold Change") points(deseq2.sig.res$baseMean, deseq2.sig.res$log2FoldChange, pch=20, cex=0.45, col="red") abline(h=c(-1,1), col="blue") dev.print(png, file = "../figs/fancy-volcano.png", width = 6, height = 4, units = "in", res = 300) ``` ## Basic Volcano ```{r} # Prepare the data for plotting res_df <- as.data.frame(deseq2.res) res_df$gene <- row.names(res_df) # Create volcano plot basic_volcano <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) + geom_point(alpha = 0.6, size = 1.5) + scale_color_manual(values = c("grey", "red")) + labs(title = "Volcano Plot", x = "Log2 Fold Change", y = "-Log10 Adjusted P-value", color = "Significantly\nDifferentially Expressed") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "top") # Save plot ggsave(path = "../figs", filename="basic-volcano.png") # Print plot print(basic_volcano) ``` # Save the list of Differentially Expressed Genes (DEGs)! Write the output to a table ```{r, engine='bash'} mkdir -p ../output/deseq ``` ```{r} write.table(tmp.sig, "../output/deseq/DEGlist.tab", sep = '\t', row.names = T) ``` Let's look at this list ```{r} deglist <- read.csv("../output/DEGlist.tab", sep = '\t', header = TRUE) deglist$RowName <- rownames(deglist) deglist2 <- deglist[, c("RowName", "pvalue")] # Optionally, reorder the columns head(deglist2) ``` # Summary & Next Steps ::: callout-info Then we take the list of differentially expressed genes located at '../output/deseq/DEGlist.tab' and run it through the `04-functional-enrichment-GOSeq`code! :::