This protocol outlines the steps for performing differential gene expression (DGE) analysis using three popular R packages: DESeq2, edgeR, and limma-voom. The goal is to identify differentially expressed genes (DEGs) across different conditions and to find common genes that are consistently identified by all three methods. This approach ensures robustness and reliability in the identification of key genes involved in the biological process under study.
Differential Gene Expression (DGE) analysis is a crucial step in transcriptomic studies to identify genes that exhibit significant changes in expression across different conditions. The analysis begins with count data obtained from RNA-Seq experiments, typically processed using tools like FeatureCounts or HTSeq. Several statistical approaches, including DESeq2, edgeR, and limma-voom, are commonly used for DGE analysis. These methods normalize raw counts, account for variability, and apply statistical models to identify significantly differentially expressed genes (DEGs).
In DESeq2, raw count data are first normalized, and lowly expressed genes are filtered out. The method estimates size factors and dispersions before fitting a negative binomial model to assess differential expression using the Wald test. Genes with an adjusted p-value of less than 0.05 and a log2 fold-change greater than 1 are considered significant DEGs. Similarly, edgeR normalizes data using the Trimmed Mean of M-values (TMM) method and applies a generalized linear model (GLM) with an empirical Bayes approach to estimate dispersion and identify DEGs. Limma-voom, on the other hand, transforms count data to log2 scale using the voom method before applying linear models and empirical Bayes statistics to detect differentially expressed genes.
After obtaining DEGs from multiple methods, identifying common genes across these approaches increases result reliability. Intersecting the DEG lists from DESeq2, edgeR, and limma-voom helps in identifying robust candidates for further validation. Additionally, functional enrichment analysis, such as Gene Ontology (GO) and KEGG pathway analysis using clusterProfiler, can provide insights into the biological roles of these genes. By integrating multiple DGE methods and enrichment analysis, researchers can improve the confidence and interpretability of their findings in transcriptomic studies.
RNA-Seq Data: GSE284267 count data
Reference Genome & Annotation Files: Necessary for read alignment and gene annotation.
Computational Tools:
Hardware Requirements:
High-performance computing system or cloud-based environment for large datasets.
Sufficient memory (RAM) and storage for processing sequencing data.
Statistical Knowledge:
Programming Skills:
The dataset GSE284267 examines the role of autophagy in small cell lung cancer (SCLC) heterogeneity and plasticity. Researchers conducted RNA sequencing on the H841 cell line, both untreated and treated with 200 mM trehalose for 12 hours to modulate autophagic flux. The study included three biological replicates for each condition, totaling six samples: three controls and three trehalose-treated samples. This design allowed for a comparative analysis of gene expression changes associated with autophagy modulation in SCLC cells.
# Load necessary libraries
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
library(enrichplot)
library(AnnotationDbi)
library(DESeq2)
library(edgeR)
library(limma)
library(VennDiagram)
library(ggVennDiagram)
library(ggrepel)
library(enrichplot)
# Load the count data
counts <- read.csv("cancer_countdata.csv", row.names = 1, header = TRUE)
# Replace NA values with zero (recommended for RNA-seq data)
counts[is.na(counts)] <- 0
# Display the first few rows of the count data
head(counts, 5)
## H841_Con1 H841_Con2 H841_Con3 H841_Tre_1 H841_Tre_2 H841_Tre_3
## ENSG00000000003 6828 6861 7456 5753 4573 4803
## ENSG00000000005 0 20 0 0 0 0
## ENSG00000000419 3803 3909 4434 4586 3785 3613
## ENSG00000000457 802 604 794 1058 739 859
## ENSG00000000460 3632 3673 3618 1433 907 1001
# Define the conditions (e.g., Control vs. Treated)
condition <- factor(c(rep("C", 3), rep("T", 3))) # C = Control, T = Treated
colData <- data.frame(condition, row.names = colnames(counts))
# Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)
## converting counts to integer mode
# Pre-filtering: Remove rows with low gene counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# Set the reference level for the condition factor
dds$condition <- relevel(dds$condition, ref = "C")
# Run DESeq2 analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Extract results for the comparison of interest
result_deseq2 <- results(dds, contrast = c('condition', 'C', 'T'), alpha = 0.05)
# Convert results to a data frame and add gene symbols
result_deseq2 <- as.data.frame(result_deseq2)
result_deseq2$gene <- rownames(result_deseq2)
# Identify significant DEGs
DESeq2_DEGup <- result_deseq2[result_deseq2$padj < 0.05 & result_deseq2$log2FoldChange > 1, ]
DESeq2_DEGdown <- result_deseq2[result_deseq2$padj < 0.05 & result_deseq2$log2FoldChange < -1, ]
DEGs_deseq2 <- rbind(DESeq2_DEGup, DESeq2_DEGdown)
print("Number of Up-regulated genes")
## [1] "Number of Up-regulated genes"
dim(DESeq2_DEGup)
## [1] 3147 7
# Step 1: Create DGEList object
dge <- DGEList(counts = counts, group = condition)
# Step 2: Filter lowly expressed genes
keep <- rowSums(cpm(dge) > 1) >= 2
dge <- dge[keep, , keep.lib.sizes = FALSE]
# Step 3: Normalize library sizes using TMM normalization
dge <- calcNormFactors(dge)
# Step 4: Create design matrix
design <- model.matrix(~ condition)
# Step 5: Estimate dispersions
dge <- estimateDisp(dge, design)
# Step 6: Fit the model and perform differential expression analysis
fit <- glmFit(dge, design)
# Step 7: Perform likelihood ratio test (LRT)
lrt <- glmLRT(fit, coef = 2) # Compare Treated (T) vs. Control (C)
# Extract top differentially expressed genes
result_edgeR <- topTags(lrt, n = Inf)$table
# Add gene names as a column
result_edgeR$gene <- rownames(result_edgeR)
# Identify significant DEGs (e.g., FDR < 0.05 and |logFC| > 1)
edgeR_DEGup <- result_edgeR[result_edgeR$FDR < 0.05 & result_edgeR$logFC > 1, ]
edgeR_DEGdown <- result_edgeR[result_edgeR$FDR < 0.05 & result_edgeR$logFC < -1, ]
DEGs_edgeR <- rbind(edgeR_DEGup, edgeR_DEGdown)
# Save results to a CSV file
write.csv(DEGs_edgeR, "DEGs_edgeR.csv", row.names = FALSE)
# Create DGEList object
dge <- DGEList(counts = counts, group = condition)
# Filter lowly expressed genes (keep genes with CPM > 1 in at least 2 samples)
keep <- rowSums(cpm(dge) > 1) >= 2
dge <- dge[keep, , keep.lib.sizes = FALSE]
# Normalize library sizes using TMM normalization
dge <- calcNormFactors(dge)
# Design matrix for the comparison
design <- model.matrix(~ condition)
# Apply Voom transformation
voom_data <- voom(dge, design, plot = TRUE)
# Fit a linear model and apply empirical Bayes smoothing
fit <- lmFit(voom_data, design)
fit <- eBayes(fit)
# Extract results
result_limma <- topTable(fit, coef = 2, number = Inf, adjust = "fdr")
result_limma$gene <- rownames(result_limma)
# Identify significant DEGs
limma_DEGup <- result_limma[result_limma$adj.P.Val < 0.05 & result_limma$logFC > 1, ]
limma_DEGdown <- result_limma[result_limma$adj.P.Val < 0.05 & result_limma$logFC < -1, ]
DEGs_limma <- rbind(limma_DEGup, limma_DEGdown)
# Extract gene lists from each method
genes_deseq2 <- rownames(DEGs_deseq2)
genes_edgeR <- rownames(DEGs_edgeR)
genes_limma <- rownames(DEGs_limma)
# Find common genes across all three methods
common_genes <- intersect(intersect(genes_deseq2, genes_edgeR), genes_limma)
# Display the number of common genes
head(common_genes,5)
## [1] "ENSG00000000460" "ENSG00000001631" "ENSG00000002016" "ENSG00000003096"
## [5] "ENSG00000003147"
# Create a list of gene sets
gene_lists <- list(
DESeq2 = genes_deseq2,
edgeR = genes_edgeR,
limma = genes_limma
)
venn_plot <- ggVennDiagram(gene_lists, label_alpha = 0) +
scale_fill_gradient(low = "lightgreen", high = "purple") +
theme(legend.position = "none") +
labs(title = "Venn Diagram of DEGs") +
theme_minimal() +
theme(plot.margin = margin(1, 1, 1, 1, "cm")) # Expands plot margins
print(venn_plot)
# Save common genes to a CSV file
write.csv(common_genes, "common_genes.csv", row.names = FALSE)
# Volcano Plot for DESeq2
volcano_deseq2 <- ggplot(result_deseq2, aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(aes(color = ifelse(padj < 0.05 & log2FoldChange > 1, "Upregulated",
ifelse(padj < 0.05 & log2FoldChange < -1, "Downregulated", "Not Significant"))),
alpha = 0.6) +
scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "green", "Not Significant" = "black")) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
labs(x = "log2 Fold Change", y = "-log10(adjusted p-value)", title = "Volcano Plot - DESeq2",
color = "Gene Regulation") +
theme_minimal() +
geom_text_repel(
data = rbind(
subset(result_deseq2, padj < 0.05 & log2FoldChange > 1)[1:5, ], # Top 5 upregulated genes
subset(result_deseq2, padj < 0.05 & log2FoldChange < -1)[1:5, ] # Top 5 downregulated genes
),
aes(label = gene), # Replace 'gene_name' with the column containing gene names
size = 3,
box.padding = 0.5,
point.padding = 0.3,
segment.color = 'grey50'
)
# Print volcano plot
print(volcano_deseq2)
## Warning: Removed 1154 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
##############################################
# Volcano Plot for edgeR
volcano_edgeR <- ggplot(result_edgeR, aes(x = logFC, y = -log10(FDR))) +
geom_point(aes(color = ifelse(FDR < 0.05 & logFC > 1, "Upregulated",
ifelse(FDR < 0.05 & logFC < -1, "Downregulated", "Not Significant"))),
alpha = 0.6) +
scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "green", "Not Significant" = "black")) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
labs(x = "log2 Fold Change", y = "-log10(FDR)", title = "Volcano Plot - edgeR",
color = "Gene Regulation") +
theme_minimal() +
geom_text_repel(
data = rbind(
subset(result_edgeR, FDR < 0.05 & logFC > 1)[1:5, ], # Top 5 upregulated genes
subset(result_edgeR, FDR < 0.05 & logFC < -1)[1:5, ] # Top 5 downregulated genes
),
aes(label = gene), # Replace 'gene_name' with the column containing gene names
size = 3,
box.padding = 0.5,
point.padding = 0.3,
segment.color = 'grey50'
)
# Print volcano plot
print(volcano_edgeR)
########################################
# Volcano Plot for limma-voom
volcano_limma <- ggplot(result_limma, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point(aes(color = ifelse(adj.P.Val < 0.05 & logFC > 1, "Upregulated",
ifelse(adj.P.Val < 0.05 & logFC < -1, "Downregulated", "Not Significant"))),
alpha = 0.6) +
scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "green", "Not Significant" = "black")) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
labs(x = "log2 Fold Change", y = "-log10(adjusted p-value)", title = "Volcano Plot - limma-voom",
color = "Gene Regulation") +
theme_minimal() +
geom_text_repel(
data = rbind(
subset(result_limma, adj.P.Val < 0.05 & logFC > 1)[1:5, ], # Top 5 upregulated genes
subset(result_limma, adj.P.Val < 0.05 & logFC < -1)[1:5, ] # Top 5 downregulated genes
),
aes(label = gene), # Replace 'gene_name' with the column containing gene names
size = 3,
box.padding = 0.5,
point.padding = 0.3,
segment.color = 'grey50'
)
# Print volcano plot
print(volcano_limma)
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# Install if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "AnnotationDbi"))
# Convert Ensembl IDs to Gene Symbols
gene_symbols <- mapIds(org.Hs.eg.db,
keys = common_genes,
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
# Remove NA values (unmatched Ensembl IDs)
gene_symbols <- na.omit(gene_symbols)
# GO enrichment for Biological Process
go_bp <- enrichGO(gene = gene_symbols,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "BP", # Options: "BP", "MF", "CC"
pAdjustMethod = "fdr",
pvalueCutoff = 0.05,
readable = TRUE)
# GO enrichment for Molecular Function
go_mf <- enrichGO(gene = gene_symbols,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "MF",
pAdjustMethod = "fdr",
pvalueCutoff = 0.05,
readable = TRUE)
# GO enrichment for Cellular Component
go_cc <- enrichGO(gene = gene_symbols,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "CC",
pAdjustMethod = "fdr",
pvalueCutoff = 0.05,
readable = TRUE)
# Dot plot for BP terms
dotplot(go_bp, showCategory = 10, title = "GO Biological Process (BP)")
# Dot plot for MF terms
dotplot(go_mf, showCategory = 10, title = "GO Molecular Function (MF)")
# Dot plot for MF terms
dotplot(go_cc, showCategory = 10, title = "GO Molecular Function (CC)")
# Plot Biological Process enrichment
barplot(go_bp, showCategory=10, title="Top 10 GO Biological Processes")
# Plot Molecular Function enrichment
barplot(go_mf, showCategory=10, title="Top 10 GO Molecular Functions")
# Plot Cellular Component enrichment
barplot(go_cc, showCategory=10, title="Top 10 GO Cellular Components")
# Compute pairwise term similarity
go_bp_sim <- pairwise_termsim(go_bp)
# Create the enrichment map plot with some customizations
emapplot(go_bp_sim,
label_fontsize = 12, # Adjust the label font size
node_label_size = 5, # Node label size (you can adjust as needed)
colorEdge = TRUE, # Optional: Color the edges based on similarity
title = "GO Enrichment Map") +
scale_color_gradient(low = "blue", high = "red") # Customize color gradient
# Run KEGG enrichment analysis
gene_symbols = as.data.frame(gene_symbols)
# Convert gene symbols to Entrez IDs
gene_list <- bitr(gene_symbols$gene_symbols, fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Hs.eg.db)
## 'select()' returned 1:1 mapping between keys and columns
entrez_ids <- as.list(unique(gene_list$ENTREZID))
# Extract Entrez IDs
#entrez_ids <- gene_list$ENTREZID
kegg_results <- enrichKEGG(gene = entrez_ids,
organism = "hsa", # "hsa" for human, "mmu" for mouse
pvalueCutoff = 0.05) # Adjust p-value threshold as needed
## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
dotplot(kegg_results, showCategory=20, title="KEGG Pathway Enrichment")
emapplot(pairwise_termsim(kegg_results))
barplot(kegg_results, showCategory=10, title="Top KEGG Pathways")
This protocol provides a comprehensive workflow for performing differential gene expression analysis using DESeq2, edgeR, and limma-voom. By identifying common genes across these methods, researchers can increase confidence in the robustness of their findings. The final output includes lists of differentially expressed genes and visualizations to aid in the interpretation of results.
This protocol can be adapted to specific datasets and experimental designs, ensuring flexibility and applicability across various research contexts.