Hello there, in this opportunity we will estimate the differential expression of RNAseq transcripts of Drosophila melanogaster genome during its developmental stages like L1, L2, etc using DESeq2 library in Rstudio Bioconductor package. DESeq2 its ideal for RNAseq or ChIPSeq data due to it uses dispersion estimates and relative expression changes to strengthen estimates, unlike edgeR, DESeq2 uses a geometric normalization in which the per lane scaling factor is computed as the median of the ratios of the gene count over its geometric mean ratio.
The dataset will be “modencodefly” dataset. It contains 147 different samples of D. melanogaster with 110Mpb genome, annotated with about 5.000 gene features.(https://github.com/PacktPublishing/R-Bioinformatics-Cookbook/tree/master)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.21")
BiocManager::install("DESeq2")
BiocManager::install("Biobase") #We have to make sure of installing Biobase library
### Also we need the readr package
install.packages("readr")
## package 'readr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\USUARIO\AppData\Local\Temp\RtmpI3XkEQ\downloaded_packages
install.packages("magrittr")
## package 'magrittr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\USUARIO\AppData\Local\Temp\RtmpI3XkEQ\downloaded_packages
install.packages("ggplot2")
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\USUARIO\AppData\Local\Temp\RtmpI3XkEQ\downloaded_packages
install.packages("forcats")
## package 'forcats' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\USUARIO\AppData\Local\Temp\RtmpI3XkEQ\downloaded_packages
# Load DESeq2 package
library(DESeq2)
library(ggplot2)
We will upload two existing files called modencodelfy_phenodata.txt and modencodefly_count_table.txt from a Github repository (https://github.com/PacktPublishing/R-Bioinformatics-Cookbook/tree/master/datasets).
# Set working directory
setwd("C:/Users/USUARIO/Documents/Bioinformatics2025/Chapter1RNAseq")
# Load count matrix
count_dataframe <- readr::read_tsv(file.path("datasets","modencodefly_count_table.txt"))
genes <- count_dataframe[['gene']]
count_dataframe[['gene']] <- NULL
count_matrix <- as.matrix(count_dataframe)
rownames(count_matrix) <- genes
# Load phenotype table
pheno_data <- readr::read_table2(file.path("datasets","modencodefly_phenodata.txt"))
# Filter only L1 and L2 larvae
experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which(pheno_data[['stage']] %in% experiments_of_interest)
pheno_subset <- pheno_data[columns_of_interest, ]
pheno_subset$stage <- forcats::as_factor(pheno_subset$stage)
counts_of_interest <- count_matrix[, columns_of_interest]
Once it is made, we have to perform the differential expression analysis. It cast the question: Is RNAseq data differentially expressed between L1 and L2?.
# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = counts_of_interest,
colData = pheno_subset,
design = ~ stage)
# Run DESeq pipeline (normalization + model fitting)
dds <- DESeq(dds)
# Extract results for L2 vs L1 comparison
res <- results(dds, contrast = c("stage", "L2Larvae", "L1Larvae"))
# Order results by adjusted p-value (FDR)
res <- res[order(res$padj), ]
# Show top 10 genes
head(res, 10)
## log2 fold change (MLE): stage L2Larvae vs L1Larvae
## Wald test p-value: stage L2Larvae vs L1Larvae
## DataFrame with 10 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FBgn0000075 1349.855 4.367348 0.0697743 62.5925 0 0
## FBgn0000317 1708.519 1.669397 0.0385489 43.3059 0 0
## FBgn0000448 831.094 2.682721 0.0623063 43.0570 0 0
## FBgn0000451 6624.952 5.339893 0.0638057 83.6900 0 0
## FBgn0000473 1169.223 2.025039 0.0468797 43.1965 0 0
## FBgn0001112 1013.879 3.751534 0.0683098 54.9194 0 0
## FBgn0001149 8666.857 0.965579 0.0222492 43.3984 0 0
## FBgn0001205 2445.617 1.540388 0.0357611 43.0744 0 0
## FBgn0001218 30048.876 1.226959 0.0199273 61.5718 0 0
## FBgn0001254 1395.425 5.696738 0.1012636 56.2565 0 0
Here is the explanation about the results:
Key Results | Meaning |
---|---|
FBgn ID | FlyBase gene ID |
logFC | log2 fold change. How much or less expressed the gene is in L2 larvae vs. L1 stage. Positive values mean upregulated in L2 |
logCPM | log2 of counts per million: the overall expression level |
F | the F-statistic from the F-test |
p-value | Raw p-value from the model |
FDR | False Discovery Rate (adjusted p-value). Values < 0.05 are significant |
Now we should graph the results, is convenient for this time using a volcano plot:
# Prepare data frame for ggplot
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
res_df$significant <- ifelse(res_df$padj < 0.05, "Significant", "Not Significant")
ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = significant)) +
geom_point(alpha = 0.6, size = 2) +
scale_color_manual(values = c("grey", "red")) +
theme_minimal() +
labs(title = "Volcano Plot (L2 vs L1)",
x = "Log2 Fold Change",
y = "-log10 Adjusted p-value") +
theme(legend.title = element_blank())
For purposes of practicality, a volcano plot is a type of scatter-plot like used to visualize the results of DGE (Differential Gene Expressions).We have:
In summary, logFC > 0 = UPREGULATED IN L2 logFC < 0 = downregulated in L2
# Select top 30 most significant genes
if (!require("pheatmap")) install.packages("pheatmap")
library(pheatmap)
top_genes <- rownames(res_df)[1:30]
# Transform counts for visualization (variance stabilizing transformation)
vsd <- vst(dds, blind = FALSE)
vsd_mat <- assay(vsd)
vsd_top <- vsd_mat[top_genes, ]
# Annotate columns with stage information
annotation_col <- data.frame(Stage = pheno_subset$stage)
rownames(annotation_col) <- colnames(vsd_top)
# Generate heatmap
pheatmap(vsd_top,
annotation_col = annotation_col,
show_rownames = TRUE,
fontsize_row = 8,
scale = "row",
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation",
main = "Heatmap of Top 30 Differentially Expressed Genes")