03-differently-expressed-genes-DESeq2

Author

Sarah Tanja

Published

May 16, 2023

Running DESeq2

This code performs differential expression analysis to identify deferentially expressed genes (DEGs).

First, it reads in a count matrix of isoform counts generated by kallisto and trinity, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers.

The results() function is used to extract the results table, which is ordered by gene/transcript ID.

The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called “DEGlist.tab”.

Install Packages

Code
if ("DESeq2" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install('DESeq2')
if ("kableExtra" %in% rownames(installed.packages()) == 'FALSE') install.packages('kableExtra')

Load Libraries

Code
library(DESeq2)
library(kableExtra)
library(tidyverse)
library(ggplot2)

Read in count matrix

Code
countmatrix <- read.delim("../output/trinity-matrix/mcap.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)

This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The ‘count’ is the number of times that gene was found in the RNA sequence of the sample.

Round integers up to whole numbers for analysis

Code
countmatrix <- round(countmatrix, 0)
head(countmatrix)

Create Dataframe

Here the code creates a data.frame named deseq2.colData containing information about the experimental conditions (embryos & recruits). It uses the column data dataframe named deseq2.colData to create a DESeqDataSet object using the DESeqDataSetFromMatrix function from the DESeq2 package.

Code
# make a dataframe of 4 embryos and 4 recruits
deseq2.colData <- data.frame(condition=factor(c(rep("embryos", 4), rep("recruits", 4))), 
                             type=factor(rep("single-read", 8)))

# set row names to match the column names in the count matrix
rownames(deseq2.colData) <- colnames(data)

# DESeqDataSet object created using the `DESeqDataSetFromMatrix` function
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

Negative Binomial

The DESeqDataSet object, named deseq.dds, is then passed to the DESeq() function to fit a negative binomial model and estimate dispersions.

Code
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
dim(deseq2.res)
summary(deseq2.res)

This code chunk subsets the deseq2.res data frame, selecting the rows where both of the following conditions are satisfied: !is.na(deseq2.res$padj) adjusted p-value is not missing
deseq2.res$padj <= 0.05 adjusted p-value is less than or equal to 0.05

In summary, the code selects rows from the deseq2.res data frame where the padj values are not missing and are less than or equal to 0.05. It then calculates the dimensions (number of rows and columns) of this subset, giving us the number of significantly differentially expressed genes between our comparative groups ‘embryos’ and ‘recruits’

Code
deseq2.sig.res <- (deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

head(deseq2.sig.res)

# calculate the dimensions of the resulting subset. The dim() function returns a vector with two elements: the number of rows and the number of columns in the subset.
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
Important

Results: Of the 54,384 rows (genes), 5,736 of them were significantly differentially expressed (p <= 0.05) between M. capitata embryos and recruits.

Plotting

PCA

Code
vsd <- vst(deseq2.dds, blind = FALSE)
pca <- plotPCA(vsd, intgroup = "condition")
pca

ggsave("../figs/pca.png", plot = pca, width = 8, height = 4, dpi = 600)

Heatmap

Install Packages

Code
if ("pheatmap" %in% rownames(installed.packages()) == 'FALSE') install.packages('pheatmap') 

Load Libraries

Code
library(pheatmap)

Top 50 Differentially Expressed Genes

Code
# Select top 50 differentially expressed genes
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:20]

# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
heatmap_50 <- pheatmap(log_counts_top, scale = "row")

ggsave("../figs/heatmap_50.png", plot = heatmap_50, width = 8, height = 5, dpi = 600)

heatmap_50

Fancy Volcano

Code
# The main plot
fancy_volcano <- plot(deseq2.res$baseMean, deseq2.res$log2FoldChange, 
                          pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
                          main="DEG Coral Early Life History  (pval <= 0.05)",
                          xlab="mean of normalized counts",
                          ylab="Log2 Fold Change") 
                  points(deseq2.sig.res$baseMean, deseq2.sig.res$log2FoldChange, pch=20, cex=0.45, col="red") 
                  abline(h=c(-1,1), col="blue")

dev.print(png, file = "../figs/fancy-volcano.png", width = 6, height = 4, units = "in", res = 300)

Basic Volcano

Code
# Prepare the data for plotting
res_df <- as.data.frame(deseq2.res)
res_df$gene <- row.names(res_df)

# Create volcano plot
basic_volcano <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
                 geom_point(alpha = 0.6, size = 1.5) +
                 scale_color_manual(values = c("grey", "red")) +
                 labs(title = "Volcano Plot",
                        x = "Log2 Fold Change",
                        y = "-Log10 Adjusted P-value",
                        color = "Significantly\nDifferentially Expressed") +
                 theme_minimal() +
                 theme(panel.grid.major = element_blank(),
                       panel.grid.minor = element_blank(),
                       legend.position = "top")

# Save plot
ggsave(path = "../figs", filename="basic-volcano.png")

# Print plot
print(basic_volcano)

Save the list of Differentially Expressed Genes (DEGs)!

Write the output to a table

Code
mkdir -p ../output/deseq
Code
write.table(tmp.sig, "../output/deseq/DEGlist.tab", sep = '\t', row.names = T)

Let’s look at this list

Code
deglist <- read.csv("../output/DEGlist.tab", sep = '\t', header = TRUE)
deglist$RowName <- rownames(deglist)
deglist2 <- deglist[, c("RowName", "pvalue")] # Optionally, reorder the columns
head(deglist2)

Summary & Next Steps

Info

Then we take the list of differentially expressed genes located at ‘../output/deseq/DEGlist.tab’ and run it through the 04-functional-enrichment-GOSeqcode!