Wilcoxon Rank-Sum Test for Differential Expression

Introduction

This activity demonstrates the Wilcoxon Rank-Sum Test for identifying differential gene expression (DEG). We’ll analyze expression data for two genes: one differentially expressed gene (DEG) and one non-differentially expressed gene (non-DEG). We will also adjust for multiple testing using the False Discovery Rate (FDR).

Provided Dataset

We start with a dataset containing expression levels for two genes across two conditions (Control and Treated), with 5 replicates per condition.

# Provided data
group <- rep(c("Control", "Treated"), each = 5)  # Conditions
gene_deg <- c(5.2, 5.5, 5.1, 5.3, 5.4, 7.2, 7.1, 7.3, 7.4, 7.5)  # DEG
gene_non_deg <- c(5.1, 5.2, 5.3, 5.4, 5.5, 5.1, 5.2, 5.3, 5.4, 5.5)  # Non-DEG

# Create a data frame
expression_data <- data.frame(
  Group = group,
  Gene_DEG = gene_deg,
  Gene_Non_DEG = gene_non_deg
)
head(expression_data)

##     Group Gene_DEG Gene_Non_DEG
## 1 Control      5.2          5.1
## 2 Control      5.5          5.2
## 3 Control      5.1          5.3
## 4 Control      5.3          5.4
## 5 Control      5.4          5.5
## 6 Treated      7.2          5.1

Wilcoxon Rank-Sum Test

We perform the Wilcoxon Rank-Sum Test to compare expression levels between the two groups for each gene.

# Wilcoxon test for DEG
test_deg <- wilcox.test(Gene_DEG ~ Group, data = expression_data)
test_deg

## 
##  Wilcoxon rank sum exact test
## 
## data:  Gene_DEG by Group
## W = 0, p-value = 0.007937
## alternative hypothesis: true location shift is not equal to 0

# Wilcoxon test for Non-DEG
test_non_deg <- wilcox.test(Gene_Non_DEG ~ Group, data = expression_data)

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

test_non_deg

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Gene_Non_DEG by Group
## W = 12.5, p-value = 1
## alternative hypothesis: true location shift is not equal to 0

False Discovery Rate (FDR)

When testing multiple genes, we need to adjust for multiple testing to control the False Discovery Rate (FDR). Here, we demonstrate FDR correction for the two genes.

# Collect p-values
p_values <- c(test_deg$p.value, test_non_deg$p.value)
names(p_values) <- c("Gene_DEG", "Gene_Non_DEG")

# Apply FDR correction
adjusted_p_values <- p.adjust(p_values, method = "BH")
adjusted_p_values

##     Gene_DEG Gene_Non_DEG 
##   0.01587302   1.00000000

Visualizing Results with Boxplots (optional)

Boxplot for the DEG

library(ggplot2)

# Boxplot for the DEG
ggplot(expression_data, aes(x = Group, y = Gene_DEG, fill = Group)) +
  geom_boxplot() + 
  labs(
    title = "A Differentially Expressed Gene",
    x = "Condition",
    y = "Expression Level"
  ) +
  theme_classic()

Boxplot for the Non-DEG

# Boxplot for the Non-DEG
ggplot(expression_data, aes(x = Group, y = Gene_Non_DEG, fill = Group)) +
  geom_boxplot() + 
  labs(
    title = "A Non-Differentially Expressed Gene",
    x = "Condition",
    y = "Expression Level"
  ) +
  theme_classic()

Reflection Questions

Why does the DEG show a significant difference in expression levels between the two groups?
Why does the Non-DEG not show a significant difference?
How does the FDR adjustment impact the interpretation of results when testing multiple genes?