Differential Gene Expression Analysis and Identification of Common Genes

Introduction

This protocol outlines the steps for performing differential gene expression (DGE) analysis using three popular R packages: DESeq2, edgeR, and limma-voom. The goal is to identify differentially expressed genes (DEGs) across different conditions and to find common genes that are consistently identified by all three methods. This approach ensures robustness and reliability in the identification of key genes involved in the biological process under study.

Differential Gene Expression (DGE) analysis is a crucial step in transcriptomic studies to identify genes that exhibit significant changes in expression across different conditions. The analysis begins with count data obtained from RNA-Seq experiments, typically processed using tools like FeatureCounts or HTSeq. Several statistical approaches, including DESeq2, edgeR, and limma-voom, are commonly used for DGE analysis. These methods normalize raw counts, account for variability, and apply statistical models to identify significantly differentially expressed genes (DEGs).

In DESeq2, raw count data are first normalized, and lowly expressed genes are filtered out. The method estimates size factors and dispersions before fitting a negative binomial model to assess differential expression using the Wald test. Genes with an adjusted p-value of less than 0.05 and a log2 fold-change greater than 1 are considered significant DEGs. Similarly, edgeR normalizes data using the Trimmed Mean of M-values (TMM) method and applies a generalized linear model (GLM) with an empirical Bayes approach to estimate dispersion and identify DEGs. Limma-voom, on the other hand, transforms count data to log2 scale using the voom method before applying linear models and empirical Bayes statistics to detect differentially expressed genes.

After obtaining DEGs from multiple methods, identifying common genes across these approaches increases result reliability. Intersecting the DEG lists from DESeq2, edgeR, and limma-voom helps in identifying robust candidates for further validation. Additionally, functional enrichment analysis, such as Gene Ontology (GO) and KEGG pathway analysis using clusterProfiler, can provide insights into the biological roles of these genes. By integrating multiple DGE methods and enrichment analysis, researchers can improve the confidence and interpretability of their findings in transcriptomic studies.

Requirements for the Protocol

RNA-Seq Data: GSE284267 count data
Reference Genome & Annotation Files: Necessary for read alignment and gene annotation.
Computational Tools:
- R and Bioconductor packages (for statistical analysis and visualization)
Hardware Requirements:
- High-performance computing system or cloud-based environment for large datasets.
- Sufficient memory (RAM) and storage for processing sequencing data.
Statistical Knowledge:
- Understanding of normalization methods, statistical models, and multiple testing corrections.
Programming Skills:
- Familiarity with R and command-line tools for efficient data processing.

Study data

The dataset GSE284267 examines the role of autophagy in small cell lung cancer (SCLC) heterogeneity and plasticity. Researchers conducted RNA sequencing on the H841 cell line, both untreated and treated with 200 mM trehalose for 12 hours to modulate autophagic flux. The study included three biological replicates for each condition, totaling six samples: three controls and three trehalose-treated samples. This design allowed for a comparative analysis of gene expression changes associated with autophagy modulation in SCLC cells.

Step 1: Data Preparation

# Load necessary libraries
library(clusterProfiler)
library(org.Hs.eg.db)  
library(ggplot2)
library(enrichplot)
library(AnnotationDbi)
library(DESeq2)
library(edgeR)
library(limma) 
library(VennDiagram)
library(ggVennDiagram) 
library(ggrepel)
library(enrichplot)

1.1 Import the Data

# Load the count data 
counts <- read.csv("cancer_countdata.csv", row.names = 1, header = TRUE)

# Replace NA values with zero (recommended for RNA-seq data) 
counts[is.na(counts)] <- 0  

# Display the first few rows of the count data 
head(counts, 5)

##                 H841_Con1 H841_Con2 H841_Con3 H841_Tre_1 H841_Tre_2 H841_Tre_3
## ENSG00000000003      6828      6861      7456       5753       4573       4803
## ENSG00000000005         0        20         0          0          0          0
## ENSG00000000419      3803      3909      4434       4586       3785       3613
## ENSG00000000457       802       604       794       1058        739        859
## ENSG00000000460      3632      3673      3618       1433        907       1001

1.2 Define Experimental Conditions

# Define the conditions (e.g., Control vs. Treated) 
condition <- factor(c(rep("C", 3), rep("T", 3)))  # C = Control, T = Treated 

colData <- data.frame(condition, row.names = colnames(counts))

Step 2: Differential Gene Expression Analysis with DESeq2

2.1 Load DESeq2 and Create DESeqDataSet

# Create DESeqDataSet 
dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)

## converting counts to integer mode

# Pre-filtering: Remove rows with low gene counts 
keep <- rowSums(counts(dds)) >= 10 
dds <- dds[keep,]  

# Set the reference level for the condition factor 
dds$condition <- relevel(dds$condition, ref = "C")  

# Run DESeq2 analysis 
dds <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

2.2 Extract Results

# Extract results for the comparison of interest
result_deseq2 <- results(dds, contrast = c('condition', 'C', 'T'), alpha = 0.05)  

# Convert results to a data frame and add gene symbols 

result_deseq2 <- as.data.frame(result_deseq2) 
result_deseq2$gene <- rownames(result_deseq2)  

# Identify significant DEGs 
DESeq2_DEGup <- result_deseq2[result_deseq2$padj < 0.05 & result_deseq2$log2FoldChange > 1, ] 
DESeq2_DEGdown <- result_deseq2[result_deseq2$padj < 0.05 & result_deseq2$log2FoldChange < -1, ] 
DEGs_deseq2 <- rbind(DESeq2_DEGup, DESeq2_DEGdown)

print("Number of Up-regulated genes")

## [1] "Number of Up-regulated genes"

dim(DESeq2_DEGup)

## [1] 3147    7

Step 3: Differential Gene Expression Analysis with edgeR

3.1 Load edgeR and Create DGEList

# Step 1: Create DGEList object
dge <- DGEList(counts = counts, group = condition)

# Step 2: Filter lowly expressed genes
keep <- rowSums(cpm(dge) > 1) >= 2
dge <- dge[keep, , keep.lib.sizes = FALSE]

# Step 3: Normalize library sizes using TMM normalization
dge <- calcNormFactors(dge)

# Step 4: Create design matrix
design <- model.matrix(~ condition)

# Step 5: Estimate dispersions
dge <- estimateDisp(dge, design)

# Step 6: Fit the model and perform differential expression analysis
fit <- glmFit(dge, design)

# Step 7: Perform likelihood ratio test (LRT)
lrt <- glmLRT(fit, coef = 2)  # Compare Treated (T) vs. Control (C)

3.2 Extract Results

# Extract top differentially expressed genes
result_edgeR <- topTags(lrt, n = Inf)$table

# Add gene names as a column
result_edgeR$gene <- rownames(result_edgeR)

# Identify significant DEGs (e.g., FDR < 0.05 and |logFC| > 1)
edgeR_DEGup <- result_edgeR[result_edgeR$FDR < 0.05 & result_edgeR$logFC > 1, ]
edgeR_DEGdown <- result_edgeR[result_edgeR$FDR < 0.05 & result_edgeR$logFC < -1, ]
DEGs_edgeR <- rbind(edgeR_DEGup, edgeR_DEGdown)

# Save results to a CSV file
write.csv(DEGs_edgeR, "DEGs_edgeR.csv", row.names = FALSE)

Step 4: Differential Gene Expression Analysis with limma-voom

4.1 Load limma and edgeR, and Create DGEList

# Create DGEList object 
dge <- DGEList(counts = counts, group = condition)  

# Filter lowly expressed genes (keep genes with CPM > 1 in at least 2 samples) 
keep <- rowSums(cpm(dge) > 1) >= 2 
dge <- dge[keep, , keep.lib.sizes = FALSE]  

# Normalize library sizes using TMM normalization 
dge <- calcNormFactors(dge) 

# Design matrix for the comparison 
design <- model.matrix(~ condition)  

# Apply Voom transformation 
voom_data <- voom(dge, design, plot = TRUE)

4.2 Fit Linear Model and Extract Results

# Fit a linear model and apply empirical Bayes smoothing 
fit <- lmFit(voom_data, design) 
fit <- eBayes(fit) 

# Extract results 
result_limma <- topTable(fit, coef = 2, number = Inf, adjust = "fdr") 
result_limma$gene <- rownames(result_limma)  

# Identify significant DEGs 
limma_DEGup <- result_limma[result_limma$adj.P.Val < 0.05 & result_limma$logFC > 1, ] 
limma_DEGdown <- result_limma[result_limma$adj.P.Val < 0.05 & result_limma$logFC < -1, ] 
DEGs_limma <- rbind(limma_DEGup, limma_DEGdown)

Step 5: Identify Common Genes Across Methods

5.1 Find Overlapping DEGs

# Extract gene lists from each method 
genes_deseq2 <- rownames(DEGs_deseq2) 
genes_edgeR <- rownames(DEGs_edgeR) 
genes_limma <- rownames(DEGs_limma)  

# Find common genes across all three methods 
common_genes <- intersect(intersect(genes_deseq2, genes_edgeR), genes_limma)  

# Display the number of common genes 
head(common_genes,5)

## [1] "ENSG00000000460" "ENSG00000001631" "ENSG00000002016" "ENSG00000003096"
## [5] "ENSG00000003147"

# Create a list of gene sets
gene_lists <- list(
  DESeq2 = genes_deseq2,
  edgeR = genes_edgeR,
  limma = genes_limma
)

venn_plot <- ggVennDiagram(gene_lists, label_alpha = 0) +
  scale_fill_gradient(low = "lightgreen", high = "purple") +
  theme(legend.position = "none") +
  labs(title = "Venn Diagram of DEGs") +
  theme_minimal() +
  theme(plot.margin = margin(1, 1, 1, 1, "cm")) # Expands plot margins

print(venn_plot)

5.2 Save Common Genes to a File

# Save common genes to a CSV file 
write.csv(common_genes, "common_genes.csv", row.names = FALSE)

Step 6: Visualization of Results

6.1 Volcano Plots for Each Method

# Volcano Plot for DESeq2
volcano_deseq2 <- ggplot(result_deseq2, aes(x = log2FoldChange, y = -log10(padj))) +
  geom_point(aes(color = ifelse(padj < 0.05 & log2FoldChange > 1, "Upregulated", 
                                ifelse(padj < 0.05 & log2FoldChange < -1, "Downregulated", "Not Significant"))), 
             alpha = 0.6) +
  scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "green", "Not Significant" = "black")) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  labs(x = "log2 Fold Change", y = "-log10(adjusted p-value)", title = "Volcano Plot - DESeq2",
       color = "Gene Regulation") +
  theme_minimal() +
  geom_text_repel(
    data = rbind(
      subset(result_deseq2, padj < 0.05 & log2FoldChange > 1)[1:5, ],  # Top 5 upregulated genes
      subset(result_deseq2, padj < 0.05 & log2FoldChange < -1)[1:5, ]  # Top 5 downregulated genes
    ),
    aes(label = gene),  # Replace 'gene_name' with the column containing gene names
    size = 3,
    box.padding = 0.5,
    point.padding = 0.3,
    segment.color = 'grey50'
  )

# Print volcano plot
print(volcano_deseq2)

## Warning: Removed 1154 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

##############################################

# Volcano Plot for edgeR
volcano_edgeR <- ggplot(result_edgeR, aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes(color = ifelse(FDR < 0.05 & logFC > 1, "Upregulated", 
                                ifelse(FDR < 0.05 & logFC < -1, "Downregulated", "Not Significant"))), 
             alpha = 0.6) +
  scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "green", "Not Significant" = "black")) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  labs(x = "log2 Fold Change", y = "-log10(FDR)", title = "Volcano Plot - edgeR",
       color = "Gene Regulation") +
  theme_minimal() +
  geom_text_repel(
    data = rbind(
      subset(result_edgeR, FDR < 0.05 & logFC > 1)[1:5, ],  # Top 5 upregulated genes
      subset(result_edgeR, FDR < 0.05 & logFC < -1)[1:5, ]  # Top 5 downregulated genes
    ),
    aes(label = gene),  # Replace 'gene_name' with the column containing gene names
    size = 3,
    box.padding = 0.5,
    point.padding = 0.3,
    segment.color = 'grey50'
  )

# Print volcano plot
print(volcano_edgeR)

########################################

# Volcano Plot for limma-voom
volcano_limma <- ggplot(result_limma, aes(x = logFC, y = -log10(adj.P.Val))) +
  geom_point(aes(color = ifelse(adj.P.Val < 0.05 & logFC > 1, "Upregulated", 
                                ifelse(adj.P.Val < 0.05 & logFC < -1, "Downregulated", "Not Significant"))), 
             alpha = 0.6) +
  scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "green", "Not Significant" = "black")) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  labs(x = "log2 Fold Change", y = "-log10(adjusted p-value)", title = "Volcano Plot - limma-voom",
       color = "Gene Regulation") +
  theme_minimal() +
  geom_text_repel(
    data = rbind(
      subset(result_limma, adj.P.Val < 0.05 & logFC > 1)[1:5, ],  # Top 5 upregulated genes
      subset(result_limma, adj.P.Val < 0.05 & logFC < -1)[1:5, ]  # Top 5 downregulated genes
    ),
    aes(label = gene),  # Replace 'gene_name' with the column containing gene names
    size = 3,
    box.padding = 0.5,
    point.padding = 0.3,
    segment.color = 'grey50'
  )

# Print volcano plot
print(volcano_limma)

## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Step 7: Downstream Analysis

7.1. Perform Gene Ontology (GO) Enrichment Analysis with Gene Symbols

# Install if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "AnnotationDbi"))

# Convert Ensembl IDs to Gene Symbols
gene_symbols <- mapIds(org.Hs.eg.db, 
                       keys = common_genes, 
                       column = "SYMBOL", 
                       keytype = "ENSEMBL", 
                       multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

# Remove NA values (unmatched Ensembl IDs)
gene_symbols <- na.omit(gene_symbols)

# GO enrichment for Biological Process
go_bp <- enrichGO(gene = gene_symbols,
                  OrgDb = org.Hs.eg.db,
                  keyType = "SYMBOL",
                  ont = "BP",  # Options: "BP", "MF", "CC"
                  pAdjustMethod = "fdr",
                  pvalueCutoff = 0.05,
                  readable = TRUE)

# GO enrichment for Molecular Function
go_mf <- enrichGO(gene = gene_symbols,
                  OrgDb = org.Hs.eg.db,
                  keyType = "SYMBOL",
                  ont = "MF",
                  pAdjustMethod = "fdr",
                  pvalueCutoff = 0.05,
                  readable = TRUE)

# GO enrichment for Cellular Component
go_cc <- enrichGO(gene = gene_symbols,
                  OrgDb = org.Hs.eg.db,
                  keyType = "SYMBOL",
                  ont = "CC",
                  pAdjustMethod = "fdr",
                  pvalueCutoff = 0.05,
                  readable = TRUE)

7.2.Visualize GO Enrichment Results

# Dot plot for BP terms
dotplot(go_bp, showCategory = 10, title = "GO Biological Process (BP)")

# Dot plot for MF terms
dotplot(go_mf, showCategory = 10, title = "GO Molecular Function (MF)")

# Dot plot for MF terms
dotplot(go_cc, showCategory = 10, title = "GO Molecular Function (CC)")

# Plot Biological Process enrichment
barplot(go_bp, showCategory=10, title="Top 10 GO Biological Processes")

# Plot Molecular Function enrichment
barplot(go_mf, showCategory=10, title="Top 10 GO Molecular Functions")

# Plot Cellular Component enrichment
barplot(go_cc, showCategory=10, title="Top 10 GO Cellular Components")

# Compute pairwise term similarity
go_bp_sim <- pairwise_termsim(go_bp)

# Create the enrichment map plot with some customizations
emapplot(go_bp_sim, 
         label_fontsize = 12,         # Adjust the label font size
         node_label_size = 5,         # Node label size (you can adjust as needed)
         colorEdge = TRUE,            # Optional: Color the edges based on similarity
         title = "GO Enrichment Map") + 
  scale_color_gradient(low = "blue", high = "red")  # Customize color gradient

7.3.KEGG pathway analysis

# Run KEGG enrichment analysis
gene_symbols = as.data.frame(gene_symbols)

# Convert gene symbols to Entrez IDs
gene_list <- bitr(gene_symbols$gene_symbols, fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Hs.eg.db)

## 'select()' returned 1:1 mapping between keys and columns

entrez_ids <- as.list(unique(gene_list$ENTREZID))

# Extract Entrez IDs
#entrez_ids <- gene_list$ENTREZID
kegg_results <- enrichKEGG(gene = entrez_ids,
                           organism = "hsa",   # "hsa" for human, "mmu" for mouse
                           pvalueCutoff = 0.05)  # Adjust p-value threshold as needed

## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...

## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...

dotplot(kegg_results, showCategory=20, title="KEGG Pathway Enrichment")

emapplot(pairwise_termsim(kegg_results))

barplot(kegg_results, showCategory=10, title="Top KEGG Pathways")

Conclusion

This protocol provides a comprehensive workflow for performing differential gene expression analysis using DESeq2, edgeR, and limma-voom. By identifying common genes across these methods, researchers can increase confidence in the robustness of their findings. The final output includes lists of differentially expressed genes and visualizations to aid in the interpretation of results.

This protocol can be adapted to specific datasets and experimental designs, ensuring flexibility and applicability across various research contexts.