Intro

Hello there, in this opportunity we will estimate the differential expression of RNAseq transcripts of Drosophila melanogaster genome during its developmental stages like L1, L2, etc using DESeq2 library in Rstudio Bioconductor package. DESeq2 its ideal for RNAseq or ChIPSeq data due to it uses dispersion estimates and relative expression changes to strengthen estimates, unlike edgeR, DESeq2 uses a geometric normalization in which the per lane scaling factor is computed as the median of the ratios of the gene count over its geometric mean ratio.

The dataset will be “modencodefly” dataset. It contains 147 different samples of D. melanogaster with 110Mpb genome, annotated with about 5.000 gene features.(https://github.com/PacktPublishing/R-Bioinformatics-Cookbook/tree/master)

Required Bioconductor installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.21")
BiocManager::install("DESeq2")
BiocManager::install("Biobase") #We have to make sure of installing Biobase library


### Also we need the readr package
install.packages("readr")
## package 'readr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpI3XkEQ\downloaded_packages
install.packages("magrittr")
## package 'magrittr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpI3XkEQ\downloaded_packages
install.packages("ggplot2")
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpI3XkEQ\downloaded_packages
install.packages("forcats")
## package 'forcats' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpI3XkEQ\downloaded_packages
# Load DESeq2 package
library(DESeq2)
library(ggplot2)

Analysis for Drosophila melanogaster RNAseq

Uploading Data from a count matrix

We will upload two existing files called modencodelfy_phenodata.txt and modencodefly_count_table.txt from a Github repository (https://github.com/PacktPublishing/R-Bioinformatics-Cookbook/tree/master/datasets).

# Set working directory
setwd("C:/Users/USUARIO/Documents/Bioinformatics2025/Chapter1RNAseq")

# Load count matrix
count_dataframe <- readr::read_tsv(file.path("datasets","modencodefly_count_table.txt"))
genes <- count_dataframe[['gene']]
count_dataframe[['gene']] <- NULL
count_matrix <- as.matrix(count_dataframe)
rownames(count_matrix) <- genes

# Load phenotype table

pheno_data <- readr::read_table2(file.path("datasets","modencodefly_phenodata.txt"))


# Filter only L1 and L2 larvae
experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which(pheno_data[['stage']] %in% experiments_of_interest)
pheno_subset <- pheno_data[columns_of_interest, ]
pheno_subset$stage <- forcats::as_factor(pheno_subset$stage)
counts_of_interest <- count_matrix[, columns_of_interest]

Once it is made, we have to perform the differential expression analysis. It cast the question: Is RNAseq data differentially expressed between L1 and L2?.

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = counts_of_interest,
                              colData = pheno_subset,
                              design = ~ stage)

# Run DESeq pipeline (normalization + model fitting)
dds <- DESeq(dds)

# Extract results for L2 vs L1 comparison
res <- results(dds, contrast = c("stage", "L2Larvae", "L1Larvae"))

# Order results by adjusted p-value (FDR)
res <- res[order(res$padj), ]

# Show top 10 genes
head(res, 10)
## log2 fold change (MLE): stage L2Larvae vs L1Larvae 
## Wald test p-value: stage L2Larvae vs L1Larvae 
## DataFrame with 10 rows and 6 columns
##              baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##             <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## FBgn0000075  1349.855       4.367348 0.0697743   62.5925         0         0
## FBgn0000317  1708.519       1.669397 0.0385489   43.3059         0         0
## FBgn0000448   831.094       2.682721 0.0623063   43.0570         0         0
## FBgn0000451  6624.952       5.339893 0.0638057   83.6900         0         0
## FBgn0000473  1169.223       2.025039 0.0468797   43.1965         0         0
## FBgn0001112  1013.879       3.751534 0.0683098   54.9194         0         0
## FBgn0001149  8666.857       0.965579 0.0222492   43.3984         0         0
## FBgn0001205  2445.617       1.540388 0.0357611   43.0744         0         0
## FBgn0001218 30048.876       1.226959 0.0199273   61.5718         0         0
## FBgn0001254  1395.425       5.696738 0.1012636   56.2565         0         0

Here is the explanation about the results:

Key Results Meaning
FBgn ID FlyBase gene ID
logFC log2 fold change. How much or less expressed the gene is in L2 larvae vs. L1 stage. Positive values mean upregulated in L2
logCPM log2 of counts per million: the overall expression level
F the F-statistic from the F-test
p-value Raw p-value from the model
FDR False Discovery Rate (adjusted p-value). Values < 0.05 are significant

Now we should graph the results, is convenient for this time using a volcano plot:

# Prepare data frame for ggplot
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
res_df$significant <- ifelse(res_df$padj < 0.05, "Significant", "Not Significant")

ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = significant)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c("grey", "red")) +
  theme_minimal() +
  labs(title = "Volcano Plot (L2 vs L1)",
       x = "Log2 Fold Change",
       y = "-log10 Adjusted p-value") +
  theme(legend.title = element_blank())

For purposes of practicality, a volcano plot is a type of scatter-plot like used to visualize the results of DGE (Differential Gene Expressions).We have:

  • X-axis(logFC):
    • Positive: genes upregulated in L2 Larvae
    • Nevative: genes downregulated in L2 Larvae
  • Y-axis (-log10 FDR):
    • Higher values mean more statistically significant genes
    • The higher the point is, the lower its p-value (more significant)

In summary, logFC > 0 = UPREGULATED IN L2 logFC < 0 = downregulated in L2

# Select top 30 most significant genes
if (!require("pheatmap")) install.packages("pheatmap")
library(pheatmap)
top_genes <- rownames(res_df)[1:30]

# Transform counts for visualization (variance stabilizing transformation)
vsd <- vst(dds, blind = FALSE)
vsd_mat <- assay(vsd)
vsd_top <- vsd_mat[top_genes, ]

# Annotate columns with stage information
annotation_col <- data.frame(Stage = pheno_subset$stage)
rownames(annotation_col) <- colnames(vsd_top)

# Generate heatmap
pheatmap(vsd_top,
         annotation_col = annotation_col,
         show_rownames = TRUE,
         fontsize_row = 8,
         scale = "row",
         clustering_distance_rows = "correlation",
         clustering_distance_cols = "correlation",
         main = "Heatmap of Top 30 Differentially Expressed Genes")