¨Drosophila melanogaster RNAseq data Analysis using edgeR library

background

Hello there, in this opportunity we will estimate the differential expression of RNAseq transcripts of Drosophila melanogaster genome during its developmental stages like L1, L2, etc using edgeR library in Rstudio Bioconductor package. edgeR is commonly used for count sparse RNAseq data which are multi-group experiments. It uses a normalization called TMM which is the weighted mean of log ratio between sample and control. Then, the TMM value must be close to one or shall be used as a correction factor to be applied to overall library sizes.

First of all, the first step (upstream step) requires us to take high-throughput sequence reads, align them to a reference genome and produce files describing them. Having made this, we will be able to work. So we will have two common input objects:

Count table from the text file
Our prepared dataset from the NHGRI.

The dataset will be “modencodefly” dataset. It contains 147 different samples of D. melanogaster with 110Mpb genome, annotated with about 5.000 gene features.(https://github.com/PacktPublishing/R-Bioinformatics-Cookbook/tree/master)

Required Bioconductor installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.21")
BiocManager::install("edgeR")

### Also we need the readr package
install.packages("readr")

## package 'readr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpgnsP7m\downloaded_packages

install.packages("magrittr")

## package 'magrittr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpgnsP7m\downloaded_packages

install.packages("edgeR")
install.packages("forcats")

## package 'forcats' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpgnsP7m\downloaded_packages

Analysis for Drosophila melanogaster RNAseq

Uploading Data

We will upload two existing files called datasets/ch1/modencodelfy_phenodata.txt and datasets/ch1/modencodefly_count_table.txt from a Github repository (https://github.com/PacktPublishing/R-Bioinformatics-Cookbook/tree/master/datasets). These are data from a count matrix and an ExpressionSet.

# Establish working directory, in my case, my computer
setwd("C:/Users/USUARIO/Documents/Bioinformatics2025/Chapter1RNAseq")

# Read the count table with readr library
count_dataframe <- readr::read_tsv(file.path("datasets", "modencodefly_count_table.txt"))

# Extracting genes table and name it
genes <- count_dataframe[['gene']]
count_dataframe[['gene']] <- NULL
count_matrix <- as.matrix(count_dataframe)
rownames(count_matrix) <- genes

# Read the phenotype data
pheno_data <- readr::read_table2(file.path("datasets", "modencodefly_phenodata.txt"))

Now we have to specify the experiments of interest, in this case is L1 Larvae stage and L2 Larvae stage:

experiments_of_interest = c("L1Larvae","L2Larvae")###indices of the samples
columns_of_interest = which(pheno_data[["stage"]] %in%
experiments_of_interest)###filters which samples belong to the stages L1 Larvae and L2 Larvae and stores their positions

Then we specify the grouping factor

library(magrittr)
grouping = pheno_data[["stage"]][columns_of_interest] %>%
forcats::as_factor()

Moreover, we form the subset of count data and create the Differencial Gene Expression Object:

#Subsetting the count matrix to include only the columns of interest
counts_of_interest = count_matrix[,columns_of_interest] #full matrix of raw gene counts (rows = genes, columns = samples)
library(edgeR)
count_dge = edgeR::DGEList(counts  = counts_of_interest,group = grouping) # create a dgeList that has gene expressions and groups like L1 and L2

Once it is made, we have to perform the differential expression analysis. It cast the question: Is RNAseq data differentially expressed between L1 and L2?.

eset_dge <- edgeR::DGEList(counts = counts_of_interest,group = grouping)
design = model.matrix(~ grouping)

eset_dge = edgeR::estimateDisp(eset_dge,design)
fit = edgeR::glmQLFit(eset_dge,design)
results = edgeR::glmQLFTest(fit,coef=2)
topTags(results)

## Coefficient:  groupingL2Larvae 
##                logFC    logCPM        F       PValue          FDR
## FBgn0027527 6.318477 11.148756 43306.39 8.921836e-36 1.326588e-31
## FBgn0037430 6.557811  9.109132 37053.47 4.458944e-35 2.269743e-31
## FBgn0037424 6.417995  9.715826 36957.31 4.579479e-35 2.269743e-31
## FBgn0037414 6.336991 10.704514 32230.76 1.878703e-34 6.983608e-31
## FBgn0029807 6.334533  9.008720 30679.42 3.125484e-34 9.294565e-31
## FBgn0037429 6.623754  8.525136 26529.63 1.399656e-33 3.468581e-30
## FBgn0037224 7.056029  9.195077 25539.20 2.072124e-33 4.064613e-30
## FBgn0030340 6.176240  8.500866 25406.05 2.186892e-33 4.064613e-30
## FBgn0029716 5.167029  8.977840 24890.80 2.700890e-33 4.462171e-30
## FBgn0243586 6.966860  7.769756 24146.95 3.698699e-33 5.499595e-30

Here is the explanation about the results:

Key Results	Meaning
FBgn ID	FlyBase gene ID
logFC	log2 fold change. How much or less expressed the gene is in L2 larvae vs. L1 stage. Positive values mean upregulated in L2
logCPM	log2 of counts per million: the overall expression level
F	the F-statistic from the F-test
p-value	Raw p-value from the model
FDR	False Discovery Rate (adjusted p-value). Values < 0.05 are significant

Visualization

Now we should graph the results, is convenient for this time using a volcano plot:

install.packages("ggplot2")  # We have to install it

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpgnsP7m\downloaded_packages

library(ggplot2)

results = edgeR::glmQLFTest(fit, coef = 2)
res_table = edgeR::topTags(results, n = Inf)$table
res_table$significant <- res_table$FDR < 0.05

ggplot(res_table, aes(x = logFC, y = -log10(FDR), color = significant)) + 
  geom_point(alpha = 0.6) +
  scale_color_manual(values = c("grey", "red")) +
  theme_minimal() +
  labs(title = "Volcano plot",
       x = "Log Fold Change",
       y = "-log10(FDR)") +
  theme(legend.title = element_blank())

For purposes of practicality, a volcano plot is a type of scatter-plot like used to visualize the results of DGE (Differential Gene Expressions).We have:

X-axis(logFC):
- Positive: genes upregulated in L2 Larvae
- Nevative: genes downregulated in L2 Larvae
Y-axis (-log10 FDR):
- Higher values mean more statistically significant genes
- The higher the point is, the lower its p-value (more significant)

In summary, logFC > 0 = UPREGULATED IN L2 logFC < 0 = downregulated in L2

Now, we are able to visualize it with heatmaps

### HEATMAP OF TOP DIFFERENTIALLY EXPRESSED GENES

# Install and load pheatmap if not installed
if (!require("pheatmap")) install.packages("pheatmap")
library(pheatmap)

# 1. Select significant genes (FDR < 0.05)
sig_res <- res_table[res_table$FDR < 0.05, ]

# 2. Sort by FDR (smallest first) so we take the most significant genes
sig_res <- sig_res[order(sig_res$FDR), ]

# 3. Select the top 30 most significant genes (adjust number if needed)
top_genes <- rownames(sig_res)[1:min(30, nrow(sig_res))]

# 4. Extract counts only for these genes
sig_counts <- counts_of_interest[top_genes, ]

# 5. Log-transform counts to stabilize variance
log_counts <- log2(sig_counts + 1)

# 6. Build annotation for experimental conditions (L1 vs L2)
sample_annotation <- data.frame(Stage = grouping)
rownames(sample_annotation) <- colnames(log_counts)

# 7. Create a heatmap
pheatmap(log_counts,
         scale = "row",  # Normalize expression per gene (row)
         annotation_col = sample_annotation, # Adds L1/L2 labels at top
         clustering_distance_rows = "euclidean",
         clustering_distance_cols = "euclidean",
         clustering_method = "complete",
         show_rownames = TRUE,
         show_colnames = TRUE,
         main = "Top 30 Differentially Expressed Genes (L1 vs L2)",
         fontsize_row = 6,    # Adjust font size for better readability
         fontsize_col = 8,
         border_color = NA)

Now: * the rows are equal to individual genes (top 30 significant according to the FDR) * Columns = individual samples (labeled with SRX IDs)