Code
if ("DESeq2" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install('DESeq2')
if ("kableExtra" %in% rownames(installed.packages()) == 'FALSE') install.packages('kableExtra')This code performs differential expression analysis to identify deferentially expressed genes (DEGs).
First, it reads in a count matrix of isoform counts generated by kallisto and trinity, with row names set to the gene/transcript IDs and the first column removed. It then rounds the counts to whole numbers.
The results() function is used to extract the results table, which is ordered by gene/transcript ID.
The code then prints the top few rows of the results table and calculates the number of DEGs with an adjusted p-value less than or equal to 0.05. It plots the log2 fold changes versus the mean normalized counts for all genes, highlighting significant DEGs in red and adding horizontal lines at 2-fold upregulation and downregulation. Finally, it writes the list of significant DEGs to a file called “DEGlist.tab”.
if ("DESeq2" %in% rownames(installed.packages()) == 'FALSE') BiocManager::install('DESeq2')
if ("kableExtra" %in% rownames(installed.packages()) == 'FALSE') install.packages('kableExtra')library(DESeq2)
library(kableExtra)
library(tidyverse)
library(ggplot2)countmatrix <- read.delim("../output/trinity-matrix/mcap.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)This count matrix has 54,384 rows (genes) and 8 columns (samples) Each column represents a sample, and each row represents a gene. The ‘count’ is the number of times that gene was found in the RNA sequence of the sample.
countmatrix <- round(countmatrix, 0)
head(countmatrix)Here the code creates a data.frame named deseq2.colData containing information about the experimental conditions (embryos & recruits). It uses the column data dataframe named deseq2.colData to create a DESeqDataSet object using the DESeqDataSetFromMatrix function from the DESeq2 package.
# make a dataframe of 4 embryos and 4 recruits
deseq2.colData <- data.frame(condition=factor(c(rep("embryos", 4), rep("recruits", 4))),
type=factor(rep("single-read", 8)))
# set row names to match the column names in the count matrix
rownames(deseq2.colData) <- colnames(data)
# DESeqDataSet object created using the `DESeqDataSetFromMatrix` function
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
colData = deseq2.colData,
design = ~ condition)The DESeqDataSet object, named deseq.dds, is then passed to the DESeq() function to fit a negative binomial model and estimate dispersions.
deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]
dim(deseq2.res)
summary(deseq2.res)This code chunk subsets the deseq2.res data frame, selecting the rows where both of the following conditions are satisfied: !is.na(deseq2.res$padj) adjusted p-value is not missing
deseq2.res$padj <= 0.05 adjusted p-value is less than or equal to 0.05
In summary, the code selects rows from the deseq2.res data frame where the padj values are not missing and are less than or equal to 0.05. It then calculates the dimensions (number of rows and columns) of this subset, giving us the number of significantly differentially expressed genes between our comparative groups ‘embryos’ and ‘recruits’
deseq2.sig.res <- (deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])
head(deseq2.sig.res)
# calculate the dimensions of the resulting subset. The dim() function returns a vector with two elements: the number of rows and the number of columns in the subset.
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])Results: Of the 54,384 rows (genes), 5,736 of them were significantly differentially expressed (p <= 0.05) between M. capitata embryos and recruits.
vsd <- vst(deseq2.dds, blind = FALSE)
pca <- plotPCA(vsd, intgroup = "condition")
pca
ggsave("../figs/pca.png", plot = pca, width = 8, height = 4, dpi = 600)if ("pheatmap" %in% rownames(installed.packages()) == 'FALSE') install.packages('pheatmap') library(pheatmap)# Select top 50 differentially expressed genes
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:20]
# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]
# Log-transform counts
log_counts_top <- log2(counts_top + 1)
# Generate heatmap
heatmap_50 <- pheatmap(log_counts_top, scale = "row")
ggsave("../figs/heatmap_50.png", plot = heatmap_50, width = 8, height = 5, dpi = 600)
heatmap_50# The main plot
fancy_volcano <- plot(deseq2.res$baseMean, deseq2.res$log2FoldChange,
pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
main="DEG Coral Early Life History (pval <= 0.05)",
xlab="mean of normalized counts",
ylab="Log2 Fold Change")
points(deseq2.sig.res$baseMean, deseq2.sig.res$log2FoldChange, pch=20, cex=0.45, col="red")
abline(h=c(-1,1), col="blue")
dev.print(png, file = "../figs/fancy-volcano.png", width = 6, height = 4, units = "in", res = 300)# Prepare the data for plotting
res_df <- as.data.frame(deseq2.res)
res_df$gene <- row.names(res_df)
# Create volcano plot
basic_volcano <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("grey", "red")) +
labs(title = "Volcano Plot",
x = "Log2 Fold Change",
y = "-Log10 Adjusted P-value",
color = "Significantly\nDifferentially Expressed") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top")
# Save plot
ggsave(path = "../figs", filename="basic-volcano.png")
# Print plot
print(basic_volcano)Write the output to a table
mkdir -p ../output/deseqwrite.table(tmp.sig, "../output/deseq/DEGlist.tab", sep = '\t', row.names = T)Let’s look at this list
deglist <- read.csv("../output/DEGlist.tab", sep = '\t', header = TRUE)
deglist$RowName <- rownames(deglist)
deglist2 <- deglist[, c("RowName", "pvalue")] # Optionally, reorder the columns
head(deglist2)