This analysis performs differential gene expression analysis on RNA-seq data from maize leaf samples across multiple experimental factors:
The analysis uses the limma-voom pipeline to identify genes that respond to:
This model describes the log-transformed expression \(y_{g,i}\) of gene \(g\) in sample \(i\) as a function of spatial,
developmental, and treatment effects.
Each gene is fitted separately using a common design matrix, producing
gene-specific regression coefficients.
\[ \begin{aligned} y_{g,i} &= \beta_{g,0} + \beta_{g,R}\,\text{Row}_i + \beta_{g,L}\,\text{leafc}_i + \beta_{g,T}\,T_i \\ &\quad + \beta_{g,L\times T}\,(\text{leafc}_i \times T_i) + \beta_{g,G}\,\text{G}_i + \beta_{g,G\times T}\,(\text{G}_i \times T_i) \\ &\quad + \beta_{g,G\times L}\,(\text{G}_i \times \text{leafc}_i) + u_{g,\text{col}(i):T(i)} + \varepsilon_{g,i} \end{aligned} \]
with,
\[ u_{g,\text{col}(i):T(i)} \sim N(0, \sigma^2_{u,g}), \quad \varepsilon_{g,i} \sim N(0, \phi_i\sigma^2_{g}) \]
| Symbol | Description | Varies Over | Notes |
|---|---|---|---|
| \(y_{g,i}\) | log\(_2\)(CPM) expression of gene \(g\) in sample \(i\) | genes, samples | Response variable |
| \(\beta_{g,0}\) | Intercept (mean expression at centered leaf stage, +P treatment) | genes | Baseline expression |
| \(\beta_{g,R}\) | Fixed effect of plot row | genes | Corrects for field spatial gradients |
| \(\beta_{g,L}\) | Slope for leaf developmental stage (centered) | genes | Represents linear change with leaf age |
| \(\beta_{g,T}\) | Effect of phosphorus treatment (−P vs. +P) | genes | Treatment main effect |
| \(\beta_{g,L\times T}\) | Interaction between leaf stage and treatment | genes | Tests whether developmental slope differs under P deficiency |
| \(\beta_{g,G}\) | Effect of inversion genotype | genes | Genotypic main effect |
| \(\beta_{g,G\times T}\) | Interaction between inversion and treatment | genes | Tests whether inversion modifies P response |
| \(\beta_{g,G\times L}\) | Interaction between inversion and leaf | genes | Tests whether the effet of the leaf depends on Inv4m |
| \(u_{g,\text{col}(i):T(i)}\) | Random effect of plot column nested within treatment | samples (within treatment) | Accounts for spatial autocorrelation unique to each treatment field |
| \(\varepsilon_{g,i}\) | Residual error with voom-derived precision weights | samples | Captures heteroskedastic measurement error |
Row,
leaf_c, T, and G vary at the
sample level.voom transformation, where samples with
lower measurement variance receive higher weight.This parametrization improves interpretability by centering
leaf_c (so the intercept represents average expression) and
nesting the spatial correction by treatment rather than imposing a
global linear spatial trend.
Genes are classified as:
Three levels of stringency for different analyses:
Significant DEGs: for Manhattan plots, FDR <
0.05 (is_DEG).
High-confidence DEGs: for GO enrichment, FDR
< 0.05 AND \(log_2 FC \pm 1.5\)
(is_hiconf_DEG).
Selected DEGs: for manuscript main tables, the
union (\(\cup\)) of the top 20 by
significance and the top 20 by Mahalanobis distance
(is_selected_DEG).
library(edgeR) # Differential expression analysis
library(limma) # Linear models for RNA-seq
library(rtracklayer) # Genomic annotation handling
library(GenomicRanges) # Genomic ranges operations
library(dplyr) # Data manipulation
library(tidyr) # Data manipulation
library(stringr) # String manipulation
library(ggplot2) # Plotting
library(ggpubr) # Publication ready plots
library(ggfx) # Extra effects for ggplots (shadows)
library(ggtext) # Formatted text in plots
library(robustbase) # Robust statistics (MCD for Mahalanobis)
counts <- read.csv("../data/inv4mRNAseq_gene_sample_exp.csv")
{
cat("Loaded expression data:\n")
cat(" Dimensions:", dim(counts), "\n")
cat(" Genes:", nrow(counts), "\n")
cat(" Samples:", ncol(counts) - 2, "\n")
}
## Loaded expression data:
## Dimensions: 39756 62
## Genes: 39756
## Samples: 60
sampleInfo <- read.csv("../data/PSU-PHO22_Metadata.csv")
sampleInfo$Treatment <- factor(sampleInfo$Treatment)
levels(sampleInfo$Treatment) <- c("+P", "-P")
sampleInfo$Genotype <- factor(sampleInfo$Genotype)
levels(sampleInfo$Genotype) <- c("CTRL", "Inv4m")
{
cat("\nSample metadata:\n")
cat(" Total samples:", nrow(sampleInfo), "\n")
cat(" Genotypes:", paste(unique(sampleInfo$Genotype), collapse = ", "), "\n")
cat(" Treatments:", paste(unique(sampleInfo$Treatment), collapse = ", "), "\n")
}
##
## Sample metadata:
## Total samples: 64
## Genotypes: CTRL, Inv4m
## Treatments: -P, +P
# Create sample ID mapping
tag <- sampleInfo$side_tag
names(tag) <- sampleInfo$library
# Extract gene IDs
gene_ids <- data.frame(gene = counts[, 2])
# Convert to matrix and set row names
counts <- as.matrix(counts[, -c(1:2)])
rownames(counts) <- gene_ids$gene
# Map sample names using tags
sampleNames <- tag[colnames(counts)]
colnames(counts) <- sampleNames
# Reorder metadata to match count matrix column order
sampleInfo <- sampleInfo[match(sampleNames, sampleInfo$side_tag), ]
{
cat("\nAll samples in metadata:",
all(sampleNames %in% sampleInfo$side_tag), "\n")
cat("Count matrix prepared:\n")
cat(" Genes:", nrow(counts), "\n")
cat(" Samples:", ncol(counts), "\n")
}
##
## All samples in metadata: TRUE
## Count matrix prepared:
## Genes: 39756
## Samples: 60
# Create DGEList with counts and sample information
y <- DGEList(counts = counts, samples = sampleInfo)
# Define groups from Treatment and Genotype interaction
y$group <- interaction(y$samples$Treatment, y$samples$Genotype)
cat("\nDGEList object created\n")
##
## DGEList object created
head(y$samples)
Using filterByExpr to remove genes with low counts
across samples.
# Keep genes with sufficient expression
keep <- filterByExpr(y, group = y$group)
y_filtered <- y[keep, ]
{
cat("\nExpression filtering:\n")
cat(" Genes before:", nrow(y), "\n")
cat(" Genes after:", nrow(y_filtered), "\n")
cat(" Genes removed:", sum(!keep), "\n")
}
##
## Expression filtering:
## Genes before: 39756
## Genes after: 24249
## Genes removed: 15507
Compute MDS on all libraries (after gene filtering but before sample filtering) to identify low-quality samples based on library size.
# MDS with all samples (before library size filtering)
mds_all <- plotMDS(y_filtered, pch = 21, plot = TRUE)
# Histogram of library sizes
hist(
y$samples$lib.size / 1e6,
main = "Library Size Distribution",
xlab = "Library Size (millions of reads)",
breaks = 20
)
{
cat("\nLibrary size summary:\n")
print(summary(y$samples$lib.size))
cat("\nSamples with lib.size < 20 million:",
sum(y$samples$lib.size < 2e7), "\n")
}
##
## Library size summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 327979 10716231 31446964 27205445 36129276 56805083
##
## Samples with lib.size < 20 million: 17
# Prepare data for plotting
mds_qc_data <- y_filtered$samples %>%
mutate(
dim1 = mds_all$x,
dim2 = mds_all$y,
lib_size_millions = lib.size / 1e6
)
# Plot MDS colored by library size
ggplot(mds_qc_data, aes(x = dim1, y = dim2, color = lib_size_millions)) +
geom_point(size = 3) +
scale_color_viridis_c(option = "plasma") +
labs(
x = paste0("Dim1 (", round(100 * mds_all$var.explained[1]), "%)"),
y = paste0("Dim2 (", round(100 * mds_all$var.explained[2]), "%)"),
color = "Library Size\n(millions)"
) +
theme_classic2(base_size = 15)
Samples with library size < 20 million reads are considered low quality.
# Flag low count libraries
y_filtered$samples$lowCount <- y_filtered$samples$lib.size < 2e7
# Remove low quality samples
y_filtered_bySample <- y_filtered[, !y_filtered$samples$lowCount]
{
cat("\nLow quality libraries:\n")
table(y_filtered$samples$lowCount)
cat("\nSamples after filtering:\n")
cat(" Retained:", ncol(y_filtered_bySample), "\n")
cat(" Removed:", sum(y_filtered$samples$lowCount), "\n")
}
##
## Low quality libraries:
##
## Samples after filtering:
## Retained: 43
## Removed: 17
Verify experimental design balance across factors after quality filtering.
{
with(y_filtered_bySample$samples, {
cat("\n=== Sample Distribution by Factors ===\n")
cat("\n-- Treatment --\n")
table(Treatment)
cat("\n-- Genotype --\n")
table(Genotype)
cat("\n-- Leaf Tissue --\n")
table(leaf_tissue)
cat("\n-- Treatment × Leaf Tissue --\n")
table(Treatment, leaf_tissue)
cat("\n-- Genotype × Leaf Tissue --\n")
table(Genotype, leaf_tissue)
cat("\n-- Treatment × Genotype × Leaf Tissue (3-way) --\n")
table(Treatment, Genotype, leaf_tissue)
})
}
##
## === Sample Distribution by Factors ===
##
## -- Treatment --
##
## -- Genotype --
##
## -- Leaf Tissue --
##
## -- Treatment × Leaf Tissue --
##
## -- Genotype × Leaf Tissue --
##
## -- Treatment × Genotype × Leaf Tissue (3-way) --
## , , leaf_tissue = 1
##
## Genotype
## Treatment CTRL Inv4m
## +P 3 3
## -P 1 4
##
## , , leaf_tissue = 2
##
## Genotype
## Treatment CTRL Inv4m
## +P 3 3
## -P 3 2
##
## , , leaf_tissue = 3
##
## Genotype
## Treatment CTRL Inv4m
## +P 3 3
## -P 2 3
##
## , , leaf_tissue = 4
##
## Genotype
## Treatment CTRL Inv4m
## +P 3 3
## -P 2 2
MDS reveals sources of variation in gene expression across samples. Dimensions 3 and 4 are extracted from eigenvectors scaled by square root of variance explained.
mds <- plotMDS(
y_filtered_bySample,
pch = 21,
label = y_filtered_bySample$samples$side_tag,
plot = FALSE
)
# Store MDS coordinates in sample data
d <- y_filtered_bySample$samples
d$dim1 <- mds$x
d$dim2 <- mds$y
d$dim3 <- mds$eigen.vectors[, 3] * sqrt(mds$var.explained[3])
d$dim4 <- mds$eigen.vectors[, 4] * sqrt(mds$var.explained[4])
# Prepare factors for plotting
d$Treatment <- factor(d$Treatment)
d$Genotype <- factor(d$Genotype)
d$RNA_Plant <- factor(d$RNA_Plant)
# Variance explained by each dimension
tibble(
dimension = paste("Dim", 1:4),
var_explained = round(mds$var.explained[1:4], 4)
)
Exploring which experimental factors drive the main dimensions of variation.
# Treatment
p1 <- ggplot(d, aes(x = dim1, y = dim2, color = Treatment)) +
geom_point(size = 3) +
theme_classic2(base_size = 15) +
ggtitle("Treatment")
# Row position
p2 <- ggplot(d, aes(x = dim1, y = dim2, color = row, shape = Treatment)) +
geom_point(size = 3) +
theme_classic2(base_size = 15) +
ggtitle("Row Position")
# Collection time
p3 <- ggplot(d, aes(x = dim1, y = dim2, color = decimal_time)) +
geom_point(size = 3) +
theme_classic2(base_size = 15) +
ggtitle("Collection Time")
# Collector
p4 <- ggplot(d, aes(x = dim1, y = dim2, color = COLLECTOR)) +
geom_point(size = 3) +
theme_classic2(base_size = 15) +
ggtitle("Collector")
# Genotype
p5 <- ggplot(d, aes(x = dim1, y = dim2, color = Genotype)) +
geom_point(size = 3) +
theme_classic2(base_size = 15) +
ggtitle("Genotype")
# Leaf tissue
p6 <- d %>%
mutate(leaf = factor(leaf_tissue)) %>%
ggplot(aes(x = dim1, y = dim2, color = leaf)) +
geom_point(size = 3) +
theme_classic2(base_size = 15) +
ggtitle("Leaf Tissue")
ggarrange(p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2)
d %>%
mutate(leaf = factor(leaf_tissue)) %>%
ggplot(aes(x = dim1, y = dim2)) +
xlab(paste0("dim1 (", round(100 * mds$var.explained[1]), "%)")) +
ylab(paste0("dim2 (", round(100 * mds$var.explained[2]), "%)")) +
geom_point(aes(fill = leaf, shape = Treatment), size = 4) +
scale_fill_viridis_d() +
scale_shape_manual(values = c(24, 21)) +
guides(
shape = guide_legend(
title = "Treatment",
order = 1,
override.aes = list(size = 7)
),
fill = guide_legend(
title = "Leaf",
order = 2,
override.aes = list(geom = "point", shape = 22, size = 7)
)
) +
theme_classic2(base_size = 25) +
theme(
legend.box = "horizontal",
legend.spacing = unit(0, "line"),
legend.box.spacing = unit(0, "in"),
legend.position = c(0.75, 0.17)
)
The third dimension separates by genotype.
# Set up genotype labels with italic formatting
labels <- c("CTRL", "*Inv4m*")
names(labels) <- c("CTRL", "INV4")
d %>%
ggplot(aes(x = dim3, y = dim4, fill = Genotype, shape = Treatment)) +
xlab(paste0("dim3 (", round(100 * mds$var.explained[3]), "%)")) +
ylab(paste0("dim4 (", round(100 * mds$var.explained[4]), "%)")) +
geom_point(size = 4) +
scale_fill_viridis_d(direction = -1, labels = labels) +
scale_shape_manual(values = c(24, 21)) +
guides(
shape = "none",
fill = guide_legend(
title = "Genotype",
order = 2,
override.aes = list(geom = "point", shape = 22, size = 7, reverse = TRUE)
)
) +
theme_classic2(base_size = 25) +
theme(
legend.position = c(0.89, 0.9),
legend.text = element_markdown(),
legend.spacing = unit(0, "line"),
legend.box.spacing = unit(0, "line")
)
# Calculate correlations and p-values
mds_cor_results <- tibble(
dimension = c("Dim1", "Dim2", "Dim3"),
factor = c("Treatment", "Leaf tissue", "Genotype"),
correlation = c(
cor(d$dim1, as.numeric(d$Treatment)),
cor(d$dim2, d$leaf_tissue),
cor(d$dim3, as.numeric(d$Genotype))
),
p_value = c(
cor.test(d$dim1, as.numeric(d$Treatment))$p.value,
cor.test(d$dim2, d$leaf_tissue)$p.value,
cor.test(d$dim3, as.numeric(d$Genotype))$p.value
)
) %>%
mutate(
adj_p_value = p.adjust(p_value, method = "fdr"),
correlation = round(correlation, 3),
p_value = signif(p_value, 3),
adj_p_value = signif(adj_p_value, 3)
)
mds_cor_results
Apply TMM (Trimmed Mean of M-values) normalization to account for sequencing depth differences, then fit a linear model including spatial covariates (Plot_Row, Plot_Column), biological factors (Genotype, leaf_tissue, Treatment), and the leaf_tissue:Treatment interaction. Voom transformation estimates mean-variance relationships and computes precision weights for each observation.
# TMM normalization
# --- Center the continuous leaf stage variable ---
# Convert leaf_tissue (leaf age or developmental stage) to a centered numeric covariate.
# Centering subtracts the mean leaf age so that:
# - the intercept represents expression at the *average leaf age*,
# - the Treatment coefficient corresponds to the treatment effect at mean leaf stage,
# - and collinearity between leaf age and Treatment is minimized.
d$leaf_tissue_c <- scale(d$leaf_tissue, center = TRUE, scale = FALSE)
# Design matrix: spatial covariates + biological factors + interaction
# Column correction:
# Define block: column nested within treatment
block <- interaction(d$Treatment, d$Plot_Column)
design <- model.matrix(
~ Plot_Row + leaf_tissue_c*Treatment + Genotype*Treatment + Genotype* leaf_tissue_c,
# ~ Plot_Column + Plot_Row + leaf_tissue_c + Treatment + leaf_tissue_c:Treatment + Genotype,
d
)
# Voom transformation with precision weights
voomR <- voom(y_filtered_bySample, design = design, plot = FALSE)
# Save normalized expression for downstream analyses
saveRDS(voomR$E, file = "~/Desktop/normalized_expression_logCPM_leaf_trt.rda")
saveRDS(voomR, file = "~/Desktop/normalized_expression_voom_object_leaf_trt.rda")
{
cat("Normalization factors range:",
range(y_filtered_bySample$samples$norm.factors), "\n\n\n")
cat("Design matrix:", nrow(design), "samples ×", ncol(design),
"coefficients\n")
design %>% as.data.frame %>% tibble() %>% print()
cat("Voom expression matrix:", nrow(voomR$E), "genes ×",
ncol(voomR$E), "samples\n")
}
## Normalization factors range: 1 1
##
##
## Design matrix: 43 samples × 8 coefficients
## # A tibble: 43 × 8
## `(Intercept)` Plot_Row leaf_tissue_c `Treatment-P` GenotypeInv4m
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4 -1.47 1 1
## 2 1 4 -0.465 1 1
## 3 1 4 0.535 1 1
## 4 1 4 1.53 1 1
## 5 1 6 -1.47 1 1
## 6 1 6 0.535 1 1
## 7 1 5 -1.47 1 0
## 8 1 5 -0.465 1 0
## 9 1 5 0.535 1 0
## 10 1 5 1.53 1 0
## # ℹ 33 more rows
## # ℹ 3 more variables: `leaf_tissue_c:Treatment-P` <dbl>,
## # `Treatment-P:GenotypeInv4m` <dbl>, `leaf_tissue_c:GenotypeInv4m` <dbl>
## Voom expression matrix: 24249 genes × 43 samples
Fit linear model to voom-transformed data, then apply robust empirical Bayes moderation to stabilize variance estimates and improve power for differential expression testing.
# --- 4. Estimate consensus correlation for blocks ---
corfit <- duplicateCorrelation(voomR, design, block = block)
# --- 5. Fit linear model accounting for block correlation ---
fit <- lmFit(voomR, design, block = block, correlation = corfit$consensus.correlation)
# --- 6. Apply empirical Bayes moderation ---
ebfit <- eBayes(fit)
{
cat("Model fitted:", nrow(fit$coefficients), "genes ×",
ncol(fit$coefficients), "coefficients\n")
cat("\nSignificant genes per coefficient (FDR < 0.05):\n")
print(colSums(abs(decideTests(ebfit))))
}
## Model fitted: 24249 genes × 8 coefficients
##
## Significant genes per coefficient (FDR < 0.05):
## (Intercept) Plot_Row
## 19367 0
## leaf_tissue_c Treatment-P
## 7888 6171
## GenotypeInv4m leaf_tissue_c:Treatment-P
## 778 2570
## Treatment-P:GenotypeInv4m leaf_tissue_c:GenotypeInv4m
## 2 1
We focus on biological effects while accounting for spatial covariates.
# Define predictors of interest with ORIGINAL names from model
predictors_original <- c(
"leaf_tissue_c",
"Treatment-P",
"GenotypeINV4",
"Treatment-P:GenotypeInv4m",
"leaf_tissue_c:Treatment-P",
"leaf_tissue_c:GenotypeInv4m"
)
topTable(ebfit)%>% colnames()
## [1] "Plot_Row" "leaf_tissue_c"
## [3] "Treatment.P" "GenotypeInv4m"
## [5] "leaf_tissue_c.Treatment.P" "Treatment.P.GenotypeInv4m"
## [7] "leaf_tissue_c.GenotypeInv4m" "AveExpr"
## [9] "F" "P.Value"
## [11] "adj.P.Val"
predictors_toptable <- c(
"leaf_tissue_c",
"Treatment.P",
"GenotypeINV4",
"Treatment.P.GenotypeInv4m",
"leaf_tissue_c.Treatment.P",
"leaf_tissue_c.GenotypeInv4m"
)
# Define STANDARDIZED names for output
predictors_standard <- c(
"Leaf",
"-P",
"Inv4m",
"Leaf:-P",
"Leaf:-P",
"Leaf:Inv4m"
)
# Create mapping
predictor_map <- setNames(predictors_standard, predictors_original)
{
cat("\nExtracting coefficients:\n")
for (i in seq_along(predictors_original)) {
cat(" ", predictors_original[i], "→", predictors_standard[i], "\n")
}
}
##
## Extracting coefficients:
## leaf_tissue_c → Leaf
## Treatment-P → -P
## GenotypeINV4 → Inv4m
## Treatment-P:GenotypeInv4m → Leaf:-P
## leaf_tissue_c:Treatment-P → Leaf:-P
## leaf_tissue_c:GenotypeInv4m → Leaf:Inv4m
For each predictor, extract results and calculate 95% confidence intervals.
#' Extract differential expression results for specified predictors
#'
#' @param ebfit An eBayes fitted model object from limma
#' @param predictor_map Named vector mapping coefficient names to display names
#'
#' @return A data frame with DE results for all specified predictors
extract_predictor_effects <- function(ebfit, predictor_map) {
# Get coefficient names from the model
coef_names <- colnames(coef(ebfit))
# Match predictor names to coefficient positions
coef_indices <- match(names(predictor_map), coef_names)
# Validate all predictors found
if (any(is.na(coef_indices))) {
missing <- names(predictor_map)[is.na(coef_indices)]
stop("Coefficients not found: ", paste(missing, collapse = ", "),
"\nAvailable: ", paste(coef_names, collapse = ", "))
}
# Extract results for each predictor
result_list <- lapply(seq_along(coef_indices), function(i) {
idx <- coef_indices[i]
tt <- topTable(ebfit, coef = idx, sort.by = "none", n = Inf)
# Calculate 95% confidence intervals
crit_value <- qt(0.975, ebfit$df.residual + ebfit$df.prior)
std_errors <- ebfit$stdev.unscaled[, idx] * sqrt(ebfit$s2.post)
tt$std_err <- std_errors
tt$upper <- tt$logFC + crit_value * std_errors
tt$lower <- tt$logFC - crit_value * std_errors
tt$predictor <- predictor_map[i]
tt$response <- rownames(tt)
tt
})
# Combine and format
effects <- do.call(rbind, result_list)
rownames(effects) <- NULL
effects %>%
select(predictor, response, everything()) %>%
arrange(adj.P.Val)
}
# Map coefficients (use names from coef(), not topTable())
predictor_map <- c(
"leaf_tissue_c" = "Leaf",
"Treatment-P" = "-P",
"GenotypeInv4m" = "Inv4m",
"Treatment-P:GenotypeInv4m" = "Inv4m:-P",
"leaf_tissue_c:Treatment-P" = "Leaf:-P",
"leaf_tissue_c:GenotypeInv4m" = "Leaf:Inv4m"
)
# Verify coefficient names match
{
cat("Available coefficients:\n")
print(colnames(coef(ebfit)))
}
## Available coefficients:
## [1] "(Intercept)" "Plot_Row"
## [3] "leaf_tissue_c" "Treatment-P"
## [5] "GenotypeInv4m" "leaf_tissue_c:Treatment-P"
## [7] "Treatment-P:GenotypeInv4m" "leaf_tissue_c:GenotypeInv4m"
results <-
# Define factor level order for predictors
effect_order <- c("Leaf", "-P", "Leaf:-P","Inv4m","Inv4m:-P","Leaf:Inv4m")
effects_df <- extract_predictor_effects(
ebfit,
predictor_map
) %>%
dplyr::rename(gene ="response") %>%
mutate(predictor = factor(predictor, levels = effect_order)) %>%
# Add negative log10 p-value for visualization
mutate(neglogP = -log10(adj.P.Val))
# Summary
{
cat("\nTotal tests:", nrow(effects), "\n")
cat("Tests per predictor:\n")
print(table(effects_df$predictor))
cat("\nSignificant per predictor (FDR < 0.05):\n")
print(table(effects_df$predictor[effects_df$adj.P.Val < 0.05]))
}
##
## Total tests:
## Tests per predictor:
##
## Leaf -P Leaf:-P Inv4m Inv4m:-P Leaf:Inv4m
## 24249 24249 24249 24249 24249 24249
##
## Significant per predictor (FDR < 0.05):
##
## Leaf -P Leaf:-P Inv4m Inv4m:-P Leaf:Inv4m
## 7888 6171 2570 778 2 1
Combine all predictor results and annotate with gene information.
effects_df <- effects_df %>%
mutate(is_DEG = adj.P.Val < 0.05) %>%
mutate(regulation = case_when(
is_DEG & logFC > 0 ~ "Upregulated",
is_DEG & logFC < 0 ~ "Downregulated",
.default = "Unregulated"
))
{
cat("\nCombined effects table created:\n")
with(effects_df, {
table(predictor, is_DEG)
table(predictor, regulation)
})
}
##
## Combined effects table created:
## regulation
## predictor Downregulated Unregulated Upregulated
## Leaf 3032 16361 4856
## -P 3961 18078 2210
## Leaf:-P 1161 21679 1409
## Inv4m 540 23471 238
## Inv4m:-P 2 24247 0
## Leaf:Inv4m 1 24248 0
Identifies extreme differentially expressed genes using robust Mahalanobis distance based on the Minimum Covariance Determinant (MCD) method. This approach is resistant to the influence of outliers themselves, providing more reliable outlier detection than classical methods.
The MCD method estimates location and covariance using only the most central 75% of observations (alpha = 0.75), making it robust to contamination.
# Helper function: Calculate robust Mahalanobis distance for one predictor
calculate_robust_distance <- function(per_predictor, mcd_alpha) {
# Extract bivariate data (logFC and -log10(FDR))
bivariate <- per_predictor %>%
select(logFC, neglogP) %>%
as.matrix()
# Compute robust location and covariance using MCD
mcd_result <- covMcd(bivariate, alpha = mcd_alpha)
# Calculate robust Mahalanobis distances
per_predictor$mahalanobis <- mahalanobis(
x = bivariate,
center = mcd_result$center,
cov = mcd_result$cov
)
per_predictor
}
# Main function: Add robust Mahalanobis outlier flags
add_mahalanobis_outliers <- function(
data = NULL,
distance_quantile = 0.05,
FDR = 0.05,
mcd_alpha = 0.75
) {
# Calculate robust Mahalanobis distance per predictor
data <- split(data, factor(data$predictor)) %>%
lapply(calculate_robust_distance, mcd_alpha = mcd_alpha) %>%
bind_rows()
# Chi-square cutoff for bivariate data (df = 2)
cutoff <- qchisq(p = 1 - distance_quantile, df = 2)
# Flag outliers: significant AND extreme distance
data$is_mh_outlier <- (data$adj.P.Val < FDR) & (data$mahalanobis > cutoff)
# Sort by distance within groups
data %>%
ungroup() %>%
group_by(predictor, regulation) %>%
arrange(-mahalanobis, .by_group = TRUE) %>%
ungroup()
}
effects_df <- add_mahalanobis_outliers(
effects_df,
distance_quantile = 0.05,
FDR = 0.05
)
{
cat("\nMahalanobis outliers detected:\n")
with(effects_df, {
table(predictor, is_mh_outlier)
})
}
##
## Mahalanobis outliers detected:
## is_mh_outlier
## predictor FALSE TRUE
## Leaf 19460 4789
## -P 19650 4599
## Leaf:-P 21679 2570
## Inv4m 23471 778
## Inv4m:-P 24247 2
## Leaf:Inv4m 24248 1
Load gene symbols, functional descriptions (Pannzer), and genomic coordinates (B73 v5 GFF3). Gene IDs are cleaned and coordinates imported as both GRanges (for overlap operations) and data.frame (for dplyr filtering).
# Gene symbols and locus names
gene_symbol <- read.table(
"/Users/fvrodriguez/Library/CloudStorage/GoogleDrive-frodrig4@ncsu.edu/My Drive/repos/inv4mRNA/data/gene_symbol.tab",
quote = "",
header = TRUE,
sep = "\t",
na.strings = ""
)
# Pannzer functional annotations
pannzer <- read.table(
"../data/PANNZER_DESC.tab",
quote = "",
header = TRUE,
sep = "\t"
) %>%
group_by(gene_model) %>%
dplyr::slice(1) %>%
select(gene_model, desc)
# Replace your gene_pannzer creation
gene_symbol_unique <- gene_symbol %>%
group_by(gene_model, locus_symbol) %>%
dplyr::slice(1) %>%
ungroup()
# Merge annotations
gene_pannzer <- gene_symbol_unique %>%
left_join(pannzer, by = c("gene_model" = "gene_model")) %>%
group_by(gene_model) %>%
dplyr::slice(1) %>%
ungroup() %>%
select(gene_model, locus_symbol, locus_name, desc)
# Genomic coordinates (GRanges + data.frame)
v5_gff_file <- "/Users/fvrodriguez/ref/zea/Zea_mays.Zm-B73-REFERENCE-NAM-5.0.59.chr.gff3"
genes_gr <- rtracklayer::import(v5_gff_file) %>%
subset(type == "gene" & seqnames %in% 1:10)
genes <- as.data.frame(genes_gr)
genes$ID <- gsub("gene:", "", genes$ID)
{
cat("Annotations loaded:\n")
cat(" Gene symbols:", nrow(gene_symbol), "\n")
cat(" Pannzer descriptions:", nrow(pannzer), "\n")
cat(" Merged annotations:", nrow(gene_pannzer), "\n")
cat(" Genomic features:", nrow(genes), "genes across",
length(unique(genes$seqnames)), "chromosomes\n")
}
## Annotations loaded:
## Gene symbols: 44364
## Pannzer descriptions: 28308
## Merged annotations: 44303
## Genomic features: 43459 genes across 10 chromosomes
Define three nested regions on chromosome 4: (1) Inv4m inversion proper (gene-defined boundaries), (2) shared introgression including flanking regions (manually verified from genotyping data), and (3) flanking regions (in introgression but outside inversion).
# Inv4m inversion boundaries (defined by specific genes)
inv4m_start <- genes[genes$ID == "Zm00001eb190470", "start"]
inv4m_end <- genes[genes$ID == "Zm00001eb194800", "end"]
# Shared introgression boundaries (from RNAseq genotype verification)
introgression_start <- 157012149
introgression_end <- 195900523
# Extract gene IDs for each region
inv4m_gene_ids <- genes %>%
filter(seqnames == 4, start >= inv4m_start, end <= inv4m_end) %>%
pull(ID)
shared_introgression_gene_ids <- genes %>%
filter(seqnames == 4, start >= introgression_start, end <= introgression_end) %>%
pull(ID)
flanking_introgression_gene_ids <- shared_introgression_gene_ids[
!(shared_introgression_gene_ids %in% inv4m_gene_ids)
]
{
cat("Inv4m inversion: Chr4:", inv4m_start, "-", inv4m_end,
"(", length(inv4m_gene_ids), "genes )\n")
cat("Shared introgression: Chr4:", introgression_start, "-", introgression_end,
"(", length(shared_introgression_gene_ids), "genes )\n")
cat("Introgressed flanking:", length(flanking_introgression_gene_ids), "genes\n")
}
## Inv4m inversion: Chr4: 172883675 - 188132113 ( 434 genes )
## Shared introgression: Chr4: 157012149 - 195900523 ( 1099 genes )
## Introgressed flanking: 665 genes
The goal of this section is to classify differentially expressed genes (DEGs) into three tiers of stringency based on significance (FDR) and effect size (\(\log_2\text{FC}\)), as well as a final set of genes selected for deep-dive analysis.
The effect sizes are interpreted differently based on the predictor:
Leaf Position (leaf predictor): This was modeled as a continuous slope (\(\beta\)). To be considered a large effect, the total change across the 3 units (Leaf 1 \(\rightarrow\) Leaf 4) must meet the \(|\log_2\text{FC}| > \frac{3}{2}\) criterion. \[\\ |\text{Total Change}| = |3\beta| > \frac{3}{2} \implies |\beta| > \frac{1}{2} = 0.5\]
We use \(|\beta| > 0.5\) as the threshold for the leaf slope an interacition with -P.
Categorical Predictors (-P, Inv4m, Interaction): The standard large effect size threshold of \(|\log_2\text{FC}| > 1.5\) is applied directly.
Significant DEGs (is_DEG): All genes with FDR (adjusted P-value) \(< 0.05\).
High-confidence DEGs (is_hiconf_DEG): Significant DEGs that also meet the large effect size thresholds defined above.
Selected DEGs (is_selected_DEG): The union of the top 20 genes ranked by P-value and the top 20 high-confidence genes ranked by the Mahalanobis distance to the non-significant centroid near the origin of the \((log_2(FC) \times-log_{10}FDR)\) plane.
This first tier identifies all genes that pass the statistical significance threshold, regardless of effect size. We define is_DEG as any gene where the adjusted P-value (is_significant) is less than 0.05.
{
cat("\nSignificant DEGs (is_DEG, FDR < 0.05) Count:\n")
print(with(effects_df, table(predictor, is_DEG)))
}
##
## Significant DEGs (is_DEG, FDR < 0.05) Count:
## is_DEG
## predictor FALSE TRUE
## Leaf 16361 7888
## -P 18078 6171
## Leaf:-P 21679 2570
## Inv4m 23471 778
## Inv4m:-P 24247 2
## Leaf:Inv4m 24248 1
The second tier filters the significant DEGs further by applying the custom large effect size thresholds. This step defines is_hiconf_DEG as only those genes that are significant AND meet the \(\log_2\text{FC}\) threshold specific to their predictor type (5 for leaf slope, 1.5 for categorical terms).
is_large_effect <- rep(FALSE, nrow(effects_df))
is_leaf <- effects_df$predictor == "Leaf"
is_interaction <- effects_df$predictor == "Leaf:-P" | effects_df$predictor == "Inv4m:-P" | effects_df$predictor == "Leaf:Inv4m"
is_large_effect[is_leaf & abs(effects_df$logFC) > 0.5] <- TRUE
is_large_effect[is_interaction & abs(effects_df$logFC) > 0.5] <- TRUE
is_large_effect[!is_leaf & abs(effects_df$logFC) > 1.5] <- TRUE
effects_df <- effects_df %>%
mutate(is_hiconf_DEG = is_DEG & is_large_effect)
{
cat("\nHigh-Confidence DEGs (is_hiconf_DEG) Count:\n")
print(with(effects_df, table(predictor, is_hiconf_DEG)))
print(with(effects_df %>% filter(is_hiconf_DEG),
table(predictor, regulation,is_hiconf_DEG)) %>%
as_tibble() %>%
arrange(desc(predictor),desc(regulation))
)
}
##
## High-Confidence DEGs (is_hiconf_DEG) Count:
## is_hiconf_DEG
## predictor FALSE TRUE
## Leaf 22818 1431
## -P 23455 794
## Leaf:-P 23684 565
## Inv4m 24084 165
## Inv4m:-P 24247 2
## Leaf:Inv4m 24248 1
## # A tibble: 12 × 4
## predictor regulation is_hiconf_DEG n
## <chr> <chr> <chr> <int>
## 1 Leaf:Inv4m Upregulated TRUE 0
## 2 Leaf:Inv4m Downregulated TRUE 1
## 3 Leaf:-P Upregulated TRUE 367
## 4 Leaf:-P Downregulated TRUE 198
## 5 Leaf Upregulated TRUE 682
## 6 Leaf Downregulated TRUE 749
## 7 Inv4m:-P Upregulated TRUE 0
## 8 Inv4m:-P Downregulated TRUE 2
## 9 Inv4m Upregulated TRUE 79
## 10 Inv4m Downregulated TRUE 86
## 11 -P Upregulated TRUE 582
## 12 -P Downregulated TRUE 212
The final tier, is_selected_DEG, selects the most interesting genes for visualization and detailed annotation. A gene is selected if it is among the top N by P-value (among all DEGs) OR among the top N by Mahalanobis distance (among high-confidence DEGs). This step requires calculating intra-group rankings first.
rank_threshold <- 10
effects_df <- effects_df %>%
group_by(is_hiconf_DEG, predictor, regulation) %>%
mutate(
pval_rank = row_number(dplyr::desc(neglogP)),
mahal_rank = row_number(dplyr::desc(mahalanobis))
) %>%
ungroup() %>%
mutate(
is_selected_DEG = (pval_rank <= rank_threshold & is_hiconf_DEG) |
(mahal_rank <= rank_threshold & is_hiconf_DEG)
)
{
cat(sprintf("\nSelected DEGs (is_selected_DEG, Top N=%d by Rank) Count:\n",
rank_threshold))
with(
effects_df %>% filter(is_selected_DEG & regulation != "Unregulated"),
table(regulation, predictor, is_selected_DEG)
)
}
##
## Selected DEGs (is_selected_DEG, Top N=10 by Rank) Count:
## , , is_selected_DEG = TRUE
##
## predictor
## regulation Leaf -P Leaf:-P Inv4m Inv4m:-P Leaf:Inv4m
## Downregulated 19 17 13 10 2 1
## Upregulated 19 17 14 14 0 0
# Join gene annotations (locus_name, desc come from gene_pannzer)
effects_df <- effects_df %>%
# 1. Join with gene_pannzer
left_join(
gene_pannzer,
by = c(gene = "gene_model"),
relationship = "many-to-many"
) %>%
# 2. Merge locus_name and desc (now that they exist)
mutate(desc_merged = coalesce(locus_name, desc)) %>%
# 3. Reorder columns for readability (using 'gene' as the key column)
select(predictor, regulation, gene, locus_symbol, desc_merged, everything()) %>%
# 4. Add genomic coordinates
inner_join(
genes %>%
select(gene = ID, CHR = seqnames, BP = start) %>%
mutate(CHR = as.character(CHR) %>% as.integer()),
by = "gene"
)
# 5. Make locus_symbol the default locus_label
# Remove symbols corresponding to DNA markers in the consensusmap
effects_df <- effects_df %>%
mutate(
locus_label = case_when(
is.na(locus_symbol) ~ NA_character_,
# Exclude markers/clones (BACs, probes, etc.)
grepl("^si\\d*[a-h]", locus_symbol) ~ NA_character_,
grepl("^umc", locus_symbol) ~ NA_character_, # SSR
grepl("^csu", locus_symbol) ~ NA_character_, # Probe EST
grepl("^bnlg", locus_symbol) ~ NA_character_, # SSR
grepl("^php\\d\\d", locus_symbol) ~ NA_character_, # Probe RFLP
grepl("^csu\\d+\a", locus_symbol) ~ NA_character_, # Probe Overgo
grepl("^pco", locus_symbol) ~ NA_character_, # Probe Overgo
grepl("^IDP", locus_symbol) ~ NA_character_, # Probe INDEL
grepl("^TIDP\\d{4}", locus_symbol) ~ NA_character_, # Probe
grepl("^cl\\d*_\\d", locus_symbol) ~ NA_character_, # Probe Overgo
grepl("^cl\\d*_-", locus_symbol) ~ NA_character_, # Probe Overgo
grepl("^Zm00001eb", locus_symbol) ~ NA_character_, # Assembly id
grepl("^Zm00001d", locus_symbol) ~ NA_character_, # Assembly id
grepl("^GRM", locus_symbol) ~ NA_character_, # Assembly id
grepl("LOC\\d{4}", locus_symbol) ~ NA_character_, # Assembly id
# Default: if it passed all the exclusions, keep the original symbol
TRUE ~ locus_symbol
)
)
{
cat("\nAnnotations added:\n")
cat(" Final columns:", ncol(effects_df), "\n")
cat(" Genes with coordinates:\n")
with(effects_df,
table(predictor,!is.na(effects_df$CHR))
)
}
##
## Annotations added:
## Final columns: 27
## Genes with coordinates:
##
## predictor TRUE
## Leaf 24011
## -P 24011
## Leaf:-P 24011
## Inv4m 24011
## Inv4m:-P 24011
## Leaf:Inv4m 24011
curated <- read.csv("~/Desktop/selected_DEGs_leaf_interaction_model_curated.csv") %>%
select(gene, locus_label,desc_merged) %>%
group_by(gene) %>%
dplyr::slice(1) %>% # Keep first curated annotation per gene
ungroup()
# Coalesce curated locus_label into effects_df
effects_df <- effects_df %>%
left_join(curated, by = "gene", suffix = c("", "_curated")) %>%
mutate(desc_merged = ifelse(desc_merged =="",NA,desc_merged)) %>%
mutate(locus_label = coalesce(locus_label_curated, locus_label),
desc_merged = coalesce(desc_merged_curated, desc_merged)) %>%
select(-locus_label_curated, -desc_merged_curated)
Classify genes by genomic location relative to Inv4m.
#' Count unique genes in a specified genomic region
#'
#' @param effects_df Data frame containing gene annotations and region indicators
#' @param region Character string specifying the region column name
#' (e.g., "in_Inv4m", "in_cis", "in_flank", "in_trans")
#'
#' @return Integer count of unique genes in the specified region
count_genes_in_region <- function(effects_df, region) {
stopifnot(
is.data.frame(effects_df),
is.character(region),
region %in% names(effects_df),
"gene" %in% names(effects_df)
)
effects_df %>%
filter(.data[[region]]) %>%
distinct(gene) %>%
nrow()
}
effects_df <- effects_df %>%
mutate(
in_Inv4m = gene %in% inv4m_gene_ids,
in_cis = gene %in% shared_introgression_gene_ids,
in_flank = gene %in% flanking_introgression_gene_ids,
in_trans = !in_cis
)
regions <- c("in_Inv4m", "in_cis", "in_flank", "in_trans")
{
cat("\nRegion classification:\n")
sapply(regions, function(r) count_genes_in_region(effects_df , r))
}
##
## Region classification:
## in_Inv4m in_cis in_flank in_trans
## 253 608 355 23403
Test whether DEGs are enriched in specific genomic regions using Fisher’s exact test.
# Helper function for Fisher's exact test
run_fisher <- function(data, region_col, deg_col) {
contingency <- table(data[[region_col]], data[[deg_col]])
fisher_result <- fisher.test(contingency)
tibble(
in_region_DEG = contingency[2, 2],
in_region_total = sum(contingency[2, ]),
outside_DEG = contingency[1, 2],
outside_total = sum(contingency[1, ]),
odds_ratio = fisher_result$estimate,
p_value = fisher_result$p.value,
enrichment = (contingency[2, 2] / sum(contingency[2, ])) /
(contingency[1, 2] / sum(contingency[1, ]))
)
}
# Create separate data frame for Inv4m predictor
Region_effects <- effects_df %>%
filter(predictor == "Inv4m") %>%
mutate(outside = in_trans)
# All DEG enrichment tests
deg_tests <- bind_rows(
run_fisher(Region_effects, "in_cis", "is_DEG") %>%
mutate(comparison = "Shared introgression vs Outside"),
run_fisher(
Region_effects %>% filter(in_flank | outside),
"in_flank",
"is_DEG"
) %>%
mutate(comparison = "Flanking vs Outside"),
run_fisher(
Region_effects %>% filter(in_Inv4m | outside),
"in_Inv4m",
"is_DEG"
) %>%
mutate(comparison = "Inv4m vs Outside"),
run_fisher(
Region_effects %>% filter(in_cis),
"in_Inv4m",
"is_DEG"
) %>%
mutate(comparison = "Inv4m vs Flanking")
) %>%
select(comparison, everything())
{
cat("DEG Enrichment (FDR < 0.05):\n")
deg_tests
}
## DEG Enrichment (FDR < 0.05):
# High-confidence DEG enrichment tests
hiconf_deg_tests <- bind_rows(
run_fisher(Region_effects, "in_cis", "is_hiconf_DEG") %>%
mutate(comparison = "Shared introgression vs Outside"),
run_fisher(
Region_effects %>% filter(in_flank | outside),
"in_flank",
"is_hiconf_DEG"
) %>%
mutate(comparison = "Flanking vs Outside"),
run_fisher(
Region_effects %>% filter(in_Inv4m | outside),
"in_Inv4m",
"is_hiconf_DEG"
) %>%
mutate(comparison = "Inv4m vs Outside"),
run_fisher(
Region_effects %>% filter(in_cis),
"in_Inv4m",
"is_hiconf_DEG"
) %>%
mutate(comparison = "Inv4m vs Flanking")
) %>%
select(comparison, everything())
{
cat("\nHigh-confidence DEG Enrichment (FDR < 0.05, |logFC| > 1.5):\n")
hiconf_deg_tests
}
##
## High-confidence DEG Enrichment (FDR < 0.05, |logFC| > 1.5):
{
cat("\n=== Key Findings ===\n")
cat("1. DEGs are enriched in Inv4m region vs genome-wide\n")
cat("2. Both Inv4m and flanking show enrichment vs outside\n")
cat("3. Regional enrichment pattern observed for high-confidence DEGs\n")
cat("\nInterpretation: The introgression region (inversion + flanking)\n")
cat("shows elevated differential expression patterns.\n")
}
##
## === Key Findings ===
## 1. DEGs are enriched in Inv4m region vs genome-wide
## 2. Both Inv4m and flanking show enrichment vs outside
## 3. Regional enrichment pattern observed for high-confidence DEGs
##
## Interpretation: The introgression region (inversion + flanking)
## shows elevated differential expression patterns.
Showing top differentially expressed genes by adjusted p-value.
top_degs_qc <- effects_df %>%
filter(is_DEG, regulation != "Unregulated") %>%
group_by(predictor, regulation) %>%
arrange(-neglogP, .by_group = TRUE) %>%
dplyr::slice(1:10) %>%
select(
predictor, gene, locus_symbol,
desc_merged, logFC, neglogP
) %>%
arrange(predictor, regulation,-neglogP)
{
cat("\nTop 10 DEGs per predictor and regulation:\n")
top_degs_qc
}
##
## Top 10 DEGs per predictor and regulation:
Most extreme differentially expressed genes.
top_outliers_qc <- effects_df %>%
filter(is_mh_outlier) %>%
group_by(predictor, regulation) %>%
arrange(-mahalanobis, .by_group = TRUE) %>%
dplyr::slice(1:10) %>%
select(
predictor, regulation, gene,
locus_symbol, desc_merged,
logFC, neglogP, mahalanobis
) %>%
arrange(-neglogP)
{
cat("\nTop Mahalanobis outliers per predictor and regulation:\n")
top_outliers_qc
}
##
## Top Mahalanobis outliers per predictor and regulation:
# Overall DEG counts by predictor
overall_summary <- effects_df %>%
group_by(predictor) %>%
summarise(
total_genes = n(),
n_significant = sum(is_DEG),
n_DEG = sum(is_DEG),
n_hiconf_DEG = sum(is_hiconf_DEG),
n_selected_DEG = sum(is_selected_DEG),
pct_DEG = round(100 * n_DEG / total_genes, 2)
)
# Region distribution for Inv4m effect
inv4m_region_summary <- effects_df %>%
filter(predictor == "Inv4m", is_DEG) %>%
group_by(regulation, in_Inv4m, in_cis) %>%
summarise(n = n(), .groups = "drop")
{
cat("\n=== DEG Summary Statistics ===\n")
overall_summary
cat("\nInv4m DEGs by region and regulation:\n")
inv4m_region_summary
}
##
## === DEG Summary Statistics ===
##
## Inv4m DEGs by region and regulation:
Extract selected DEGs for detailed presentation in manuscript tables.
selected_degs <- effects_df %>%
filter(is_selected_DEG) %>%
select(
predictor,
regulation,
gene,
locus_symbol,
locus_label,
desc_merged,
std_err,
logFC,
neglogP,
mahalanobis,
pval_rank,
mahal_rank
) %>%
arrange(predictor, regulation, -neglogP)
{
cat("\n=== Selected DEGs for Manuscript ===\n")
cat("Total selected DEGs:", nrow(selected_degs), "\n\n")
cat("Counts by predictor and regulation:\n")
print(with(selected_degs, table(predictor, regulation)))
}
##
## === Selected DEGs for Manuscript ===
## Total selected DEGs: 122
##
## Counts by predictor and regulation:
## regulation
## predictor Downregulated Upregulated
## Leaf 18 19
## -P 17 17
## Leaf:-P 13 11
## Inv4m 10 14
## Inv4m:-P 2 0
## Leaf:Inv4m 1 0
Export selected DEGs specific to the phosphorus effect with pannzer description.
p_selected <- selected_degs %>%
mutate(
regulation = factor(
regulation,
levels = c("Upregulated", "Downregulated")
)
) %>%
filter(predictor == "-P" ) %>%
arrange(regulation, -neglogP)
{
cat("\nPhosphorus effect selected DEGs:\n")
cat(" Upregulated:", sum(p_selected$regulation == "Upregulated"), "\n")
cat(" Downregulated:", sum(p_selected$regulation == "Downregulated"), "\n")
p_selected
}
##
## Phosphorus effect selected DEGs:
## Upregulated: 17
## Downregulated: 17
leaf_selected <- selected_degs %>%
mutate(
regulation = factor(
regulation,
levels = c("Upregulated", "Downregulated")
)
) %>%
filter(predictor == "Leaf") %>%
arrange(regulation, -neglogP)
{
cat("\nLeaf stage effect selected DEGs:\n")
cat(" Upregulated:", sum(leaf_selected$regulation == "Upregulated"), "\n")
cat(" Downregulated:", sum(leaf_selected$regulation == "Downregulated"), "\n")
leaf_selected
}
##
## Leaf stage effect selected DEGs:
## Upregulated: 19
## Downregulated: 18
Export selected DEGs specific to the leaf:treatment interaction.
interaction_selected <- selected_degs %>%
mutate(
regulation = factor(
regulation,
levels = c("Upregulated", "Downregulated")
)
) %>%
group_by(regulation) %>%
filter(predictor == "Leaf:-P") %>%
arrange(regulation, -neglogP)
{
cat("\nLeaf:Treatment interaction selected DEGs:\n")
cat(" Upregulated:", sum(interaction_selected$regulation == "Upregulated"), "\n")
cat(" Downregulated:",
sum(interaction_selected$regulation == "Downregulated"), "\n")
interaction_selected
}
##
## Leaf:Treatment interaction selected DEGs:
## Upregulated: 11
## Downregulated: 13
# Load normalized expression data
voomR <- readRDS("~/Desktop/normalized_expression_voom_object_leaf_trt.rda")
# Define genes of interest - High_confidence interaction genes
interaction_highconf <- effects_df %>%
filter(is_hiconf_DEG & predictor == "Leaf:-P")
interaction_genes <- interaction_highconf %>%
pull(gene) %>%
unique()
# Extract expression matrix for genes of interest
expr_matrix <- voomR$E[interaction_genes, ]
# Convert to long format and join with sample metadata
expr_long <- as.data.frame(t(expr_matrix)) %>%
tibble::rownames_to_column("sample") %>%
pivot_longer(
cols = -sample,
names_to = "gene",
values_to = "log10CPM"
) %>%
left_join(
voomR$targets %>%
tibble::rownames_to_column("sample") %>%
select(sample, Treatment, leaf_tissue, Genotype),
by = "sample"
) %>%
left_join(
interaction_highconf %>%
filter(predictor == "Leaf:-P") %>%
select(gene, regulation) %>%
distinct(),
by = "gene"
)
# Recode treatment levels
expr_long$leafxP <- factor(expr_long$regulation)
levels(expr_long$leafxP) <- c("Negative", "Positive")
# expr_long$leafxP <- factor(expr_long$leafxP, c("Positive","Negative"))
# Center expression per gene-treatment combination
expr_long <- expr_long %>%
group_by(gene, Treatment) %>%
mutate(centered_log10CPM = log10CPM - mean(log10CPM)) %>%
ungroup()
# Calculate mean centered expression per gene per leaf stage per treatment
expr_summary <- expr_long %>%
group_by(gene, Treatment, leaf_tissue, leafxP) %>%
summarise(mean_log10CPM = mean(centered_log10CPM), .groups = "drop")
# Create separate datasets for each treatment
expr_minusP <- expr_summary %>% filter(Treatment == "-P")
expr_plusP <- expr_summary %>% filter(Treatment == "+P")
# Create spaghetti plot
base_size <- 30
# Create base plot for -P treatment (no shadow)
p_minusP <- expr_minusP %>%
ggplot(aes(x = leaf_tissue, y = mean_log10CPM, color = Treatment)) +
geom_line(
aes(group = interaction(gene, Treatment)),
alpha = 0.1,
linewidth = 0.8
) +
geom_smooth(
aes(group = Treatment),
method = "lm",
formula = y ~ x,
se = FALSE,
linewidth = 2
) %>%
with_shadow(colour = "black", x_offset = 0, y_offset = 0, sigma = 2) +
facet_wrap(~ leafxP, ncol = 2) +
labs(
x = "Leaf",
y = expression("Centered " * log[10] * "(CPM)")
) +
theme_classic(base_size = base_size) +
theme(
strip.background = element_blank(),
strip.text.x = element_blank(),
plot.title = element_blank(),
axis.title.y = element_text(margin = margin(r = -10)),
plot.margin = margin(5.5, 7, 50, 5.5, "pt")
)
# Overlay the +P plots with shadows
p_spaghetti <- p_minusP +
geom_line(
data = expr_plusP,
aes(x = leaf_tissue, y = mean_log10CPM, color = Treatment, group = interaction(gene, Treatment)),
alpha = 0.1,
linewidth = 0.8
) %>%
with_shadow(colour = "black", x_offset = 0, y_offset = 0, sigma = 2) +
geom_smooth(
data = expr_plusP,
aes(x = leaf_tissue, y = mean_log10CPM, color = Treatment, group = Treatment),
method = "lm",
formula = y ~ x,
se = FALSE,
linewidth = 2
) %>%
with_shadow(colour = "black", x_offset = 0, y_offset = 0, sigma = 2) +
scale_color_manual(
values = c("+P" = "gold", "-P" = "purple4"),
name = "Treatment"
) +
guides(color = guide_legend(
override.aes = list(linewidth = 2, alpha = 1)
)) +
theme(
legend.position = c(0.4, 0.95),
legend.title = element_blank(),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent", color = NA)
)
p_spaghetti
# Select top annotated genes per regulation direction
interaction_for_profiles <- interaction_highconf %>%
mutate(regulation = factor(
regulation,
levels= c("Downregulated","Upregulated")
)) %>%
filter( predictor == "Leaf:-P") %>%
group_by(regulation) %>%
arrange(regulation,-neglogP) %>%
dplyr::slice(1:12) %>%
ungroup() %>%
mutate(locus_label = paste0(
locus_label, " ", gene, "\n",
str_wrap(coalesce(locus_name, desc_merged), width = 30,
whitespace_only = FALSE)
)) %>%
mutate(locus_label = gsub("pco139502b |NA ", "", locus_label))
interaction_for_profiles$locus_label <- gsub("^ +","",interaction_for_profiles$locus_label)
interaction_for_profiles$gene
## [1] "Zm00001eb359280" "Zm00001eb207130" "Zm00001eb389720" "Zm00001eb070520"
## [5] "Zm00001eb212520" "Zm00001eb179680" "Zm00001eb111630" "Zm00001eb362560"
## [9] "Zm00001eb214780" "Zm00001eb071770" "Zm00001eb004410" "Zm00001eb325410"
## [13] "Zm00001eb157810" "Zm00001eb376160" "Zm00001eb063230" "Zm00001eb144680"
## [17] "Zm00001eb339870" "Zm00001eb393060" "Zm00001eb148030" "Zm00001eb009430"
## [21] "Zm00001eb011050" "Zm00001eb289800" "Zm00001eb297970" "Zm00001eb040890"
# Filter and annotate existing expr_long
expr_plot <- expr_long %>%
inner_join(
interaction_for_profiles
) %>% group_by(regulation) %>%
mutate(locus_label = forcats::fct_reorder(locus_label,-neglogP))
# Define gene display order (top 2 per direction)
gene_order <- interaction_for_profiles %>%
group_by(regulation) %>%
arrange(regulation,-neglogP) %>%
dplyr::slice(1:2) %>%
pull(gene)
gene_order
## [1] "Zm00001eb359280" "Zm00001eb207130" "Zm00001eb157810" "Zm00001eb376160"
# Reaction norm plot
p_gene <- expr_plot %>%
filter(gene %in% gene_order) %>%
group_by(regulation) %>%
arrange(regulation,-neglogP) %>%
ungroup() %>%
ggplot(aes(x = leaf_tissue, y = log10CPM,
color = Treatment, group = Treatment)) +
stat_summary(fun = mean, geom = "line", linewidth = 1) +
stat_summary(fun = mean, geom = "point", size = 5) +
stat_summary(fun.data = mean_se, geom = "errorbar",
width = 0.1, linewidth = 1.5) +
scale_color_manual(
values = c("+P" = "gold", "-P" = "purple4"),
name = "Treatment"
) +
guides(color = guide_legend(reverse = TRUE)) +
facet_wrap(~ locus_label, scales = "free_y", ncol = 2, dir = "v") +
labs(
x = "Leaf",
y = expression(log[10]*"(CPM)")
) +
theme_classic(base_size = base_size) +
theme(
strip.background = element_blank(),
strip.text = element_text(size = 15, face = "bold", hjust = 0),
plot.margin = margin(-53, 5.5, 5.5, 5.5, "pt"),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent", color = NA),
legend.position = c(0.075, 0.65),
legend.title = element_blank()
)
p_gene
plot_genes <- interaction_for_profiles%>%
group_by(regulation) %>%
select(predictor:gene,locus_label,desc_merged, neglogP) %>%
distinct() %>%
arrange(regulation, -neglogP) %>%
dplyr::slice(1:10) %>%
pull(gene)
p_interaction <- expr_plot %>%
filter(gene %in% plot_genes) %>%
group_by(regulation) %>%
arrange(regulation, -neglogP) %>%
ggplot(aes(x = leaf_tissue, y = log10CPM,
color = Treatment, group = Treatment)) +
stat_summary(fun = mean, geom = "line", linewidth = 1) +
stat_summary(fun = mean, geom = "point", size = 5) +
stat_summary(fun.data = mean_se, geom = "errorbar",
width = 0.1, linewidth = 1.5) +
scale_color_manual(
values = c("+P" = "gold", "-P" = "purple4"),
name = "Treatment"
) +
guides(color = guide_legend(reverse = TRUE)) +
facet_wrap(~ locus_label, scales = "free_y",
ncol = 5, nrow=4,) +
labs(
x = "Leaf",
y = expression(log[10]*"(CPM)")
) +
theme_classic(base_size = base_size) +
theme(
strip.background = element_blank(),
strip.text = element_text(size = 15, face = "bold", hjust = 0),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent", color = NA),
legend.position = "top",
legend.title = element_blank()
)
p_interaction
# Combine plots
right_panel <- cowplot::plot_grid(
p_spaghetti, p_gene,
ncol = 1,
align = 'h',
axis = 'lr'
)
ggsave(
"~/Desktop/right_panel.png",
plot = right_panel,
width = 8,
height = 15,
dpi = 150
)
right_panel
{
cat("Spaghetti plot created with shadows on gold lines only\n")
cat("Total unique genes plotted:", length(plot_genes), "\n")
}
## Spaghetti plot created with shadows on gold lines only
## Total unique genes plotted: 20
# Full effects table with all columns
write.csv(
effects_df,
file = "~/Desktop/predicto r_effects_leaf_interaction_model.csv",
row.names = FALSE
)
{
cat("\nFull effects table exported:\n")
cat(" predictor_effects_leaf_interaction_model.csv\n")
}
# Selected DEGs for manuscript
write.csv(
selected_degs,
file = "~/Desktop/selected_DEGs_leaf_interaction_model.csv",
row.names = FALSE
)
# Phosphorus selected DEGs
write.csv(
p_selected,
file = "~/Desktop/p_selected_DEGs_leaf_interaction_model.csv",
row.names = FALSE
)
write.csv(
p_selected,
file = "~/Desktop/leaf_selected_DEGs_leaf_interaction_model.csv",
row.names = FALSE
)
# Inv4m selected DEGs
write.csv(
selected_degs %>% filter(predictor == "Inv4m"),
file = "~/Desktop/inv4m_selected_DEGs_leaf_interaction_model.csv",
row.names = FALSE
)
# Leaf:Treatment interaction selected DEGs
write.csv(
interaction_selected,
file = "~/Desktop/leafxP_selected_DEGs_leaf_interaction_model.csv",
row.names = FALSE
)
{
cat("\nSelected DEG tables exported:\n")
cat(" selected_DEGs_leaf_interaction_model.csv - All selected DEGs\n")
cat(" p_selected_DEGs_leaf_interaction_model.csv - Phosphorus effect\n")
cat(" p_selected_DEGs_leaf_interaction_model.csv - Leaf effect\n")
cat(" inv4m_selected_DEGs_leaf_interaction_model.csv - Inv4m effect\n")
cat(" leafxP_selected_DEGs_leaf_interaction_model.csv - Leaf:Treatment interaction\n")
}
This analysis identified differentially expressed genes across four main effects:
Inv4m genotype: Genes with different expression in Inv4m vs Control (|logFC| > 1.5, FDR < 0.05)
Leaf position gradient: Genes showing expression changes along the apical-basal axis (|logFC| > 0.5, FDR < 0.05)
Phosphorus deficiency: Genes responding to low P treatment (|logFC| > 1.5, FDR < 0.05)
Leaf × Treatment interaction: Genes where the leaf gradient differs between P treatments (|logFC| > 1.5, FDR < 0.05)
MDS analysis revealed that:
Spatial covariates (Plot_Row, Plot_Column) were included to account for field position effects
Region enrichment: DEGs show enrichment in the Inv4m region, with patterns observed across both inversion and flanking regions
Three-tier classification:
All results include:
The full effects table contains 144066 gene × predictor combinations.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] robustbase_0.99-6 ggtext_0.1.2 ggfx_1.0.3
## [4] ggpubr_0.6.2 ggplot2_4.0.0 stringr_1.5.2
## [7] tidyr_1.3.1 dplyr_1.1.4 rtracklayer_1.69.1
## [10] GenomicRanges_1.61.5 Seqinfo_0.99.2 IRanges_2.43.5
## [13] S4Vectors_0.47.4 BiocGenerics_0.55.4 generics_0.1.4
## [16] edgeR_4.7.6 limma_3.65.7
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 rlang_1.1.6
## [3] magrittr_2.0.4 matrixStats_1.5.0
## [5] compiler_4.5.1 mgcv_1.9-3
## [7] systemfonts_1.3.1 vctrs_0.6.5
## [9] pkgconfig_2.0.3 crayon_1.5.3
## [11] fastmap_1.2.0 backports_1.5.0
## [13] magick_2.9.0 XVector_0.49.1
## [15] labeling_0.4.3 utf8_1.2.6
## [17] Rsamtools_2.25.3 rmarkdown_2.30
## [19] markdown_2.0 ragg_1.5.0
## [21] purrr_1.1.0 xfun_0.53
## [23] cachem_1.1.0 litedown_0.7
## [25] jsonlite_2.0.0 DelayedArray_0.35.3
## [27] BiocParallel_1.43.4 broom_1.0.10
## [29] parallel_4.5.1 R6_2.6.1
## [31] bslib_0.9.0 stringi_1.8.7
## [33] RColorBrewer_1.1-3 car_3.1-3
## [35] jquerylib_0.1.4 Rcpp_1.1.0
## [37] SummarizedExperiment_1.39.2 knitr_1.50
## [39] Matrix_1.7-4 splines_4.5.1
## [41] tidyselect_1.2.1 rstudioapi_0.17.1
## [43] abind_1.4-8 yaml_2.3.10
## [45] codetools_0.2-20 curl_7.0.0
## [47] lattice_0.22-7 tibble_3.3.0
## [49] Biobase_2.69.1 withr_3.0.2
## [51] S7_0.2.0 evaluate_1.0.5
## [53] xml2_1.4.0 Biostrings_2.77.2
## [55] pillar_1.11.1 MatrixGenerics_1.21.0
## [57] carData_3.0-5 RCurl_1.98-1.17
## [59] commonmark_2.0.0 scales_1.4.0
## [61] glue_1.8.0 tools_4.5.1
## [63] BiocIO_1.19.0 locfit_1.5-9.12
## [65] ggsignif_0.6.4 GenomicAlignments_1.45.4
## [67] forcats_1.0.1 XML_3.99-0.19
## [69] cowplot_1.2.0 grid_4.5.1
## [71] nlme_3.1-168 restfulr_0.0.16
## [73] Formula_1.2-5 cli_3.6.5
## [75] textshaping_1.0.4 S4Arrays_1.9.1
## [77] viridisLite_0.4.2 gtable_0.3.6
## [79] DEoptimR_1.1-4 rstatix_0.7.3
## [81] sass_0.4.10 digest_0.6.37
## [83] SparseArray_1.9.1 rjson_0.2.23
## [85] farver_2.1.2 htmltools_0.5.8.1
## [87] lifecycle_1.0.4 httr_1.4.7
## [89] statmod_1.5.1 gridtext_0.1.5