# extract normalization factors from dge object (created in setup)
norm_factors <- data.frame(
  sample = colnames(dge),
  lib_size = dge$samples$lib.size,
  norm_factor = dge$samples$norm.factors,
  effective_lib = dge$samples$lib.size * dge$samples$norm.factors
) %>%
  left_join(sample_meta, by = "sample")

write_tsv(norm_factors, here("06-comp-bias", "outputs", "tmm_norm_factors.tsv"))

p <- ggplot(norm_factors, aes(x = timepoint_label, y = norm_factor)) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  geom_point(size = 3, alpha = 0.7) +
  stat_summary(fun = mean, geom = "line", group = 1, color = "steelblue", linewidth = 0.8) +
  labs(
    title = "TMM Normalization Factors - K. lactis",
    subtitle = "Values < 1 at 30min suggest compositional shift; line = mean",
    x = "Timepoint",
    y = "Normalization Factor"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave(here("06-comp-bias", "figures", "01_tmm_norm_factors.pdf"), p, width = 8, height = 5)
ggsave(here("06-comp-bias", "figures", "01_tmm_norm_factors.png"), p, width = 8, height = 5, dpi = 150)

p

norm_factors %>%
  group_by(timepoint_label) %>%
  summarise(
    mean_nf = mean(norm_factor),
    sd_nf = sd(norm_factor),
    .groups = "drop"
  ) %>%
  print()
## # A tibble: 13 × 3
##    timepoint_label mean_nf     sd_nf
##    <fct>             <dbl>     <dbl>
##  1 0min              0.904  0.0139  
##  2 15min             0.934  0.0113  
##  3 30min             0.836  0.0143  
##  4 45min             0.923  0.0204  
##  5 60min             0.935  0.00368 
##  6 90min             0.991  0.000913
##  7 120min            1.03  NA       
##  8 150min            1.07   0.0104  
##  9 180min            1.07   0.00372 
## 10 210min            1.07   0.0164  
## 11 240min            1.09   0.00484 
## 12 360min            1.11   0.00168 
## 13 480min            1.10   0.00789

Diagnostic 1: TMM Normalization Factors

Rationale: TMM normalization assumes most genes are not differentially expressed. When a subset of genes is massively upregulated, they consume a larger fraction of sequencing reads, inflating those libraries. The raw counts of non-induced genes are artificially depressed by the compositional shift (dilution). TMM assigns NF < 1.0 to correct this by inflating normalized expression values.

Interpretation: - TMM selected Klac.t0480.b3 (480min) as the reference—the sample whose upper quartile is closest to the mean upper quartile across all samples - NF < 1.0: library has more reads concentrated in high-abundance genes relative to reference - NF > 1.0: library has less concentration in dominant genes

The 30min samples (NF ≈ 0.83–0.85) are scaled down ~15-17% relative to the 480min reference, indicating acute compositional shift at this timepoint.

General Take: The 30min timepoint shows a clear dip (NF ≈ 0.84), indicating compositional shift. This is consistent with rapid, strong induction of phosphate-responsive genes at this timepoint. Factors recover and exceed 1.0 by 150min, suggesting the transcripto

# RLE calculation:
# for each gene, calculates the median expression across ALL samples
# then for each sample, calculate deviation from that median
gene_medians <- apply(log2cpm_uncorrected, 1, median)
rle_matrix <- log2cpm_uncorrected - gene_medians

# converts to long format for plotting
rle_long <- as.data.frame(rle_matrix) %>%
  rownames_to_column("gene_id") %>%
  pivot_longer(-gene_id, names_to = "sample", values_to = "rle") %>%
  left_join(sample_meta, by = "sample")

rle_summary <- rle_long %>%
  group_by(sample, timepoint_min, timepoint_label) %>%
  summarise(
    median_rle = median(rle),
    iqr_rle = IQR(rle),
    q25 = quantile(rle, 0.25),
    q75 = quantile(rle, 0.75),
    .groups = "drop"
  )

write_tsv(rle_summary, here("06-comp-bias", "outputs", "rle_summary.tsv"))

# RLE boxplot (motivated by TMM Biostars forum I found)
p1 <- ggplot(rle_long, aes(x = timepoint_label, y = rle)) +
  geom_boxplot(aes(fill = timepoint_label), outlier.shape = NA, alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "firebrick", linewidth = 1) +
  coord_cartesian(ylim = c(-2, 2)) +
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title = "RLE Plot: Relative Log Expression by Timepoint",
    subtitle = "Boxes shifted below 0 indicate systematic downward bias; centered at 0 = no bias",
    x = "Timepoint",
    y = "Relative Log Expression (log2 deviation from gene median)",
    caption = "Each gene's expression compared to its median across all samples"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

ggsave(here("06-comp-bias", "figures", "01B_rle_boxplot.pdf"), p1, width = 10, height = 5)
ggsave(here("06-comp-bias", "figures", "01B_rle_boxplot.png"), p1, width = 10, height = 5, dpi = 150)

# median RLE per timepoint
rle_timepoint_summary <- rle_summary %>%
  group_by(timepoint_min, timepoint_label) %>%
  summarise(
    mean_median_rle = mean(median_rle),
    sd_median_rle = sd(median_rle),
    .groups = "drop"
  )

p2 <- ggplot(rle_timepoint_summary, aes(x = timepoint_label, y = mean_median_rle)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_point(size = 3, color = "steelblue") +
  geom_line(aes(group = 1), color = "steelblue", linewidth = 0.8) +
  geom_errorbar(
    aes(
      ymin = mean_median_rle - sd_median_rle,
      ymax = mean_median_rle + sd_median_rle
    ),
    width = 0.2, color = "steelblue"
  ) +
  labs(
    title = "Median RLE by Timepoint",
    subtitle = "Values below 0 indicate global downward shift (compositional bias or real repression)",
    x = "Timepoint",
    y = "Mean Median RLE (± SD)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave(here("06-comp-bias", "figures", "01B_rle_median_trend.pdf"), p2, width = 8, height = 5)
ggsave(here("06-comp-bias", "figures", "01B_rle_median_trend.png"), p2, width = 8, height = 5, dpi = 150)

# per-sample RLE with timepoint coloring
p3 <- ggplot(rle_summary, aes(x = reorder(sample, timepoint_min), y = median_rle, fill = timepoint_label)) +
  geom_col() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "firebrick") +
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title = "Median RLE per Sample",
    subtitle = "Samples with negative median RLE have globally lower expression relative to other samples",
    x = "Sample",
    y = "Median RLE",
    fill = "Timepoint"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, size = 6),
    legend.position = "bottom"
  )

ggsave(here("06-comp-bias", "figures", "01B_rle_per_sample.pdf"), p3, width = 12, height = 5)
ggsave(here("06-comp-bias", "figures", "01B_rle_per_sample.png"), p3, width = 12, height = 5, dpi = 150)

print(rle_timepoint_summary)
## # A tibble: 13 × 4
##    timepoint_min timepoint_label mean_median_rle sd_median_rle
##            <dbl> <fct>                     <dbl>         <dbl>
##  1             0 0min                    -0.164        0.0191 
##  2            15 15min                   -0.0963       0.0204 
##  3            30 30min                   -0.237        0.0237 
##  4            45 45min                   -0.120        0.0339 
##  5            60 60min                   -0.0793       0.0124 
##  6            90 90min                   -0.0220       0.00807
##  7           120 120min                   0           NA      
##  8           150 150min                   0.0199       0.0109 
##  9           180 180min                   0.0272       0.00631
## 10           210 210min                   0.0179       0.0253 
## 11           240 240min                   0.0412       0.00912
## 12           360 360min                   0.0467       0.00765
## 13           480 480min                   0.0180       0.0128
p1

p2

p3

Rationale: RLE (Relative Log Expression) provides a TMM-independent view of compositional bias. For each gene, I calculate its deviation from its own median across all samples. If a sample has systematically lower expression, its RLE distribution shifts below zero.

Interpretation: The 30min samples show median RLE ≈ -0.25, confirming the compositional shift detected by TMM using an orthogonal method. This is not circular with TMM as RLE uses gene-wise medians, while TMM uses trimmed means of M-values. Convergent evidence from independent methods strengthens confidence that the 30min compositional shift is real.

General Take: Both TMM NF (0.84) and RLE median (-0.25) point to the same phenomenon: the 30min library is dominated by fewer high-abundance transcripts.

# calculate library proportion per gene per sample
lib_totals <- colSums(counts_collapsed)
lib_prop <- t(t(counts_collapsed) / lib_totals) * 100

prop_summary <- data.frame(
  gene_id = rownames(lib_prop),
  prop_0min = rowMeans(lib_prop[, samples_0, drop = FALSE]),
  prop_30min = rowMeans(lib_prop[, samples_30, drop = FALSE])
) %>%
  left_join(id_to_name, by = c("gene_id" = "salmon_id")) %>%
  mutate(
    prop_change = prop_30min - prop_0min,
    fold_change = prop_30min / pmax(prop_0min, 0.001),
    gene_label = ifelse(is.na(gene_name) | gene_name == "", gene_id, gene_name)
  ) %>%
  arrange(desc(prop_30min))

top30 <- head(prop_summary, 30)
write_tsv(top30, here("06-comp-bias", "outputs", "top30_genes_by_lib_prop_30min.tsv"))

top_n_cumsum <- data.frame(
  n_genes = 1:100,
  cum_prop_0min = cumsum(prop_summary$prop_0min[1:100]),
  cum_prop_30min = cumsum(prop_summary$prop_30min[1:100])
)
write_tsv(top_n_cumsum, here("06-comp-bias", "outputs", "cumulative_lib_prop.tsv"))

top20_long <- prop_summary %>%
  head(20) %>%
  pivot_longer(
    cols = c(prop_0min, prop_30min),
    names_to = "timepoint", values_to = "lib_percent"
  ) %>%
  mutate(timepoint = gsub("prop_", "", timepoint))

p1 <- ggplot(top20_long, aes(x = reorder(gene_label, -lib_percent), y = lib_percent, fill = timepoint)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("0min" = "gray60", "30min" = "firebrick")) +
  labs(
    title = "Top 20 Genes by Library Proportion at 30min",
    subtitle = "Red bars show 30min; gray bars show baseline (0min)",
    x = NULL,
    y = "% of Library"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggsave(here("06-comp-bias", "figures", "02_top20_lib_proportion.pdf"), p1, width = 10, height = 5)
ggsave(here("06-comp-bias", "figures", "02_top20_lib_proportion.png"), p1, width = 10, height = 5, dpi = 150)

p2 <- ggplot(top_n_cumsum, aes(x = n_genes)) +
  geom_line(aes(y = cum_prop_0min, color = "0min"), linewidth = 1) +
  geom_line(aes(y = cum_prop_30min, color = "30min"), linewidth = 1) +
  scale_color_manual(values = c("0min" = "gray60", "30min" = "firebrick")) +
  labs(
    title = "Cumulative Library Proportion - Top N Genes",
    subtitle = "Steeper rise at 30min indicates more library concentration",
    x = "Number of Top Genes",
    y = "Cumulative % of Library",
    color = "Timepoint"
  ) +
  theme_minimal()

ggsave(here("06-comp-bias", "figures", "02_cumulative_lib_proportion.pdf"), p2, width = 7, height = 5)
ggsave(here("06-comp-bias", "figures", "02_cumulative_lib_proportion.png"), p2, width = 7, height = 5, dpi = 150)

print(head(prop_summary, 10) %>% select(gene_label, prop_0min, prop_30min, prop_change))
##            gene_label prop_0min prop_30min prop_change
## 1                TEF2 1.6592782  2.3922913  0.73301308
## 2                SSB1 0.4154825  0.9053621  0.48987957
## 3                EFT2 0.5306037  0.7797674  0.24916368
## 4  gene-KLLA0_B14498g 1.1165873  0.6530856 -0.46350171
## 5                YEF3 0.2643213  0.6451652  0.38084391
## 6               RPL2A 0.2635703  0.6388559  0.37528557
## 7               CCW12 0.5321844  0.5903763  0.05819188
## 8                RPL5 0.2540112  0.5756700  0.32165888
## 9                RPL3 0.3351466  0.5711398  0.23599321
## 10              RPS4B 0.2461885  0.5523871  0.30619855
p1

p2

## Diagnostic 2 Notes Rationale: If specific genes consume a disproportionate share of the library at 30min, they could drive the compositional shift detected by TMM.I calculated each gene’s percentage of total library reads at 0min vs 30min.

Results: The top 20 genes by library proportion at 30min are dominated by translation elongation factors (TEF2, EFT2, YEF3) and ribosomal proteins (RPL3, RPL5, RPS4B, RPS6B, etc.). PHO89, a high-affinity phosphate transporter, shows a 105-fold change in library proportion (0.004% to 0.46%)

Several other genes show notable changes in library proportion at 30min:

The cumulative proportion analysis shows that reaching 20% of the library takes approximately 50 genes at 0min but only 25 genes at 30min. The library is more dominated by a small number of genes at 30min, meaning reads are distributed less evenly across the transcriptome. This makes me more confident in my initial decision for TMM normalization from the outset.

General Take: Ribosomal genes dominate the top proportion rankings because they are the most abundant transcripts in the cell. Our earlier comparison with Gurvich et al. (2017) confirms that ribosomal protein downregulation is a phosphate starvation response as cells reduce ribosome biogenesis to conserve phosphate. The compositional shift at 30min reflects changes in both PHO pathway genes and ribosomal/translation machinery– at least in S. cerevisiae. The functional themes among genes with large proportion changes (chaperones, one-carbon metabolism, mitochondrial transport) suggest K. lactis is undergoing metabolic and proteostatic remodeling alongside the canonical phosphate response. Whether housekeeping genes like TDH3, ACT1, and ATG8 are genuinely affected or appear downregulated as a normalization artifact remains to be tested later.

Before I test the housekeeping genes and their trends to the perceived compositional bias, I am going to analyze the trends via (1).

\[ \Delta P_g = \frac{\text{Counts}_{g, \text{30min}}}{\text{TotalLib}_{\text{30min}}} - \frac{\text{Counts}_{g, \text{0min}}}{\text{TotalLib}_{\text{0min}}} \tag{1} \] After discussing with Dr. He, I realize that DeltaP is no different than standard DGE analysis. Both measure relative abundance changes within a sum-constrained system.

\(\Delta P\) (this analysis): \[\Delta P_g = \frac{\text{Counts}_{g, \text{30min}}}{\text{TotalLib}_{\text{30min}}} - \frac{\text{Counts}_{g, \text{0min}}}{\text{TotalLib}_{\text{0min}}} = P_{g, \text{30}} - P_{g, \text{0}}\]

Standard log2 fold change:\[\log_2 \text{FC}_g = \log_2 \left( \frac{\text{CPM}_{g, \text{30}}}{\text{CPM}_{g, \text{0}}} \right) = \log_2(P_{g, \text{30}}) - \log_2(P_{g, \text{0}})\]

Since CPM is just proportion \(\times 10^6\), these are mathematically related transformations of the same underlying quantity:\(\Delta P\): additive difference of proportions\(\log_2 \text{FC}\): difference of log-proportions (i.e., log-ratio)Neither escapes the fundamental constraint that library proportions must sum to 100%:\[\sum_g P_{g,t} = 1 \quad \text{for all timepoints } t\]The genes with large positive \(\Delta P\) are the same genes with large positive \(\log_2 \text{FC}\). The genes with large negative \(\Delta P\) are the same genes being called as significantly downregulated in DGE.The value of this analysis isn’t that it’s fundamentally different. but its reframing in terms of “library space” made the compositional constraint explicit in my brain. Just had to redo things to get a firmer understanding. Since I had it already, I kept it.

# filtered for high abundance (mean counts > 500)
mean_counts <- rowMeans(counts_collapsed)
high_abundance_genes <- names(mean_counts[mean_counts > 500])

# calculates delta P as shown in (1)
lib_total_0 <- sum(counts_collapsed[, samples_0])
lib_total_30 <- sum(counts_collapsed[, samples_30])

delta_p <- data.frame(
  gene_id = rownames(counts_collapsed),
  counts_0 = rowSums(counts_collapsed[, samples_0, drop = FALSE]),
  counts_30 = rowSums(counts_collapsed[, samples_30, drop = FALSE])
) %>%
  mutate(
    prop_0 = counts_0 / lib_total_0,
    prop_30 = counts_30 / lib_total_30,
    delta_p = prop_30 - prop_0
  ) %>%
  filter(gene_id %in% high_abundance_genes) %>%
  left_join(id_to_name, by = c("gene_id" = "salmon_id"))

# selects cumulative drivers (80% of total shift)
# positive drivers (i.e., sets inflating library size)
positive_drivers <- delta_p %>%
  filter(delta_p > 0) %>%
  arrange(desc(delta_p)) %>%
  mutate(
    cum_delta = cumsum(delta_p),
    total_positive = sum(delta_p),
    cum_pct = cum_delta / total_positive
  ) %>%
  filter(cum_pct <= 0.80 | row_number() == 1)

# negative drivers (i.e., high abundance transcripts being dumped)
negative_drivers <- delta_p %>%
  filter(delta_p < 0) %>%
  arrange(delta_p) %>%
  mutate(
    cum_delta = cumsum(abs(delta_p)),
    total_negative = sum(abs(delta_p)),
    cum_pct = cum_delta / total_negative
  ) %>%
  filter(cum_pct <= 0.80 | row_number() == 1)


write_tsv(positive_drivers, here("06-comp-bias", "outputs", "positive_drivers.tsv"))
write_tsv(negative_drivers, here("06-comp-bias", "outputs", "negative_drivers.tsv"))

# GO enrichment
# get background (all high-abundance genes with Sc names)
background_genes <- delta_p %>%
  filter(!is.na(gene_name), gene_name != "") %>%
  pull(gene_name) %>%
  unique()

# positive driver enrichment
pos_genes <- positive_drivers %>%
  filter(!is.na(gene_name), gene_name != "") %>%
  pull(gene_name) %>%
  unique()

ego_positive <- enrichGO(
  gene = pos_genes,
  universe = background_genes,
  OrgDb = org.Sc.sgd.db,
  keyType = "GENENAME",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  qvalueCutoff = 0.1
)

# negative driver enrichment
neg_genes <- negative_drivers %>%
  filter(!is.na(gene_name), gene_name != "") %>%
  pull(gene_name) %>%
  unique()

ego_negative <- enrichGO(
  gene = neg_genes,
  universe = background_genes,
  OrgDb = org.Sc.sgd.db,
  keyType = "GENENAME",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  qvalueCutoff = 0.1
)

write_tsv(as.data.frame(ego_positive), here("06-comp-bias", "outputs", "GO_positive_drivers.tsv"))
write_tsv(as.data.frame(ego_negative), here("06-comp-bias", "outputs", "GO_negative_drivers.tsv"))

p_pos <- dotplot(ego_positive, showCategory = 15, title = "Positive Drivers") +
  theme(axis.text.y = element_text(size = 8))

p_neg <- dotplot(ego_negative, showCategory = 15, title = "Negative Drivers") +
  theme(axis.text.y = element_text(size = 8))

ggsave(here("06-comp-bias", "figures", "03_GO_positive_drivers.pdf"), p_pos, width = 8, height = 6)
ggsave(here("06-comp-bias", "figures", "03_GO_positive_drivers.png"), p_pos, width = 8, height = 6, dpi = 150)
ggsave(here("06-comp-bias", "figures", "03_GO_negative_drivers.pdf"), p_neg, width = 8, height = 6)
ggsave(here("06-comp-bias", "figures", "03_GO_negative_drivers.png"), p_neg, width = 8, height = 6, dpi = 150)

# summary stats
n_pos <- nrow(positive_drivers)
n_neg <- nrow(negative_drivers)
pct_pos <- round(max(positive_drivers$cum_pct) * 100, 1)
pct_neg <- round(max(negative_drivers$cum_pct) * 100, 1)

data.frame(
  set = c("Positive Drivers", "Negative Drivers"),
  n_genes = c(n_pos, n_neg),
  n_with_sc_name = c(length(pos_genes), length(neg_genes)),
  cumulative_pct = c(pct_pos, pct_neg)
)
##                set n_genes n_with_sc_name cumulative_pct
## 1 Positive Drivers     424            414             80
## 2 Negative Drivers     258            234             80
p_pos

p_neg

## Diagnostic 3: GO Enrichment of Compositional Drivers

Rationale: \(\Delta P\) identifies which genes are gaining vs losing library share. GO enrichment reveals whether these shifts cluster into coherent biological functions.

Gene selection criteria:

  1. Filtered for high-abundance genes (mean counts > 500 across all samples)
  2. Calculated \(\Delta P_g\) for each gene
  3. Positive drivers: genes with \(\Delta P_g > 0\), ranked by descending \(\Delta P\), selected until cumulative sum reached 80% of total positive shift
  4. Negative drivers: genes with \(\Delta P_g < 0\), ranked by ascending \(\Delta P\) (most negative first), selected until cumulative sum reached 80% of total negative shift

This captures the genes responsible for 80% of the library remodeling in each direction, excluding low-abundance genes where proportional changes are noisy.

Genes with large positive \(\Delta P\) are the same genes with large positive \(\log_2 FC\). The reframing in terms of “library space” simply made the compositional constraint explicit.

Results:

Positive Drivers (gaining library share):

Negative Drivers (losing library share):

# get salmon IDs for each set (using gene sets defined in setup)
ribo_ids <- id_to_name %>% filter(gene_name %in% ribosomal)
metab_ids <- id_to_name %>% filter(gene_name %in% metabolic)

# calculate fold changes for each gene
# uses log2cpm (from setup, TMM-normalized)
calc_fc <- function(gene_ids, category_name) {
  gene_ids %>%
    filter(salmon_id %in% rownames(log2cpm)) %>%
    rowwise() %>%
    mutate(
      log2cpm_0min = mean(log2cpm[salmon_id, samples_0]),
      log2cpm_30min = mean(log2cpm[salmon_id, samples_30]),
      log2fc = log2cpm_30min - log2cpm_0min,
      category = category_name
    ) %>%
    ungroup()
}

ribo_fc <- calc_fc(ribo_ids, "Ribosomal")
metab_fc <- calc_fc(metab_ids, "Metabolic")

combined_fc <- bind_rows(ribo_fc, metab_fc)

write_tsv(combined_fc, here("06-comp-bias", "outputs", "ribosomal_vs_metabolic_fc.tsv"))

# summary statistics
fc_summary <- combined_fc %>%
  group_by(category) %>%
  summarise(
    n_genes = n(),
    mean_log2fc = mean(log2fc),
    median_log2fc = median(log2fc),
    sd_log2fc = sd(log2fc),
    min_log2fc = min(log2fc),
    max_log2fc = max(log2fc),
    .groups = "drop"
  )

print(fc_summary)
## # A tibble: 2 × 7
##   category  n_genes mean_log2fc median_log2fc sd_log2fc min_log2fc max_log2fc
##   <chr>       <int>       <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
## 1 Metabolic      25       -1.92         -1.39     1.88      -6.34       0.767
## 2 Ribosomal      44        1.28          1.30     0.194      0.641      1.57
# statistical test
t_test <- t.test(log2fc ~ category, data = combined_fc)
print(t_test)
## 
##  Welch Two Sample t-test
## 
## data:  log2fc by category
## t = -8.4907, df = 24.291, p-value = 9.834e-09
## alternative hypothesis: true difference in means between group Metabolic and group Ribosomal is not equal to 0
## 95 percent confidence interval:
##  -3.971520 -2.419086
## sample estimates:
## mean in group Metabolic mean in group Ribosomal 
##               -1.919142                1.276161
p1 <- ggplot(combined_fc, aes(x = category, y = log2fc, fill = category)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  scale_fill_manual(values = c("Ribosomal" = "steelblue", "Metabolic" = "firebrick")) +
  labs(
    title = "Divergent Regulation at 30 Minutes",
    subtitle = sprintf("p = %.2e", t_test$p.value),
    x = NULL,
    y = "Log2 Fold Change (30min vs 0min)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

ggsave(here("06-comp-bias", "figures", "06_ribosomal_vs_metabolic_repression.pdf"), p1, width = 6, height = 5)
ggsave(here("06-comp-bias", "figures", "06_ribosomal_vs_metabolic_repression.png"), p1, width = 6, height = 5, dpi = 150)

# individual gene plot
p2 <- ggplot(combined_fc, aes(x = reorder(gene_name, log2fc), y = log2fc, fill = category)) +
  geom_col() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  scale_fill_manual(values = c("Ribosomal" = "steelblue", "Metabolic" = "firebrick")) +
  coord_flip() +
  labs(
    title = "Individual Gene Log2 Fold Changes at 30min",
    x = NULL,
    y = "Log2 Fold Change (30min vs 0min)",
    fill = "Category"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 7))

ggsave(here("06-comp-bias", "figures", "06_individual_gene_fc.pdf"), p2, width = 8, height = 10)
ggsave(here("06-comp-bias", "figures", "06_individual_gene_fc.png"), p2, width = 8, height = 10, dpi = 150)

p1

p2

## Ribosomal vs Metabolic Gene Comparison

Rationale: The GO enrichment (Diagnostic 3) identified ribosomal/translation genes as positive drivers and metabolic genes as negative drivers of the compositional shift. This analysis directly compares their expression changes at 30min.

Gene sets:

Results:

Category n Mean log2FC Median log2FC
Ribosomal 19 +0.7 +0.6
Metabolic 24 -2.1 -1.5

t-test p = 2.11e-05

Interpretation:

Ribosomal genes are induced at 30min (\(\log_2 FC > 0\)), while metabolic genes are repressed (\(\log_2 FC < 0\)). There is evidence of opposite regulation. The ~2.8 log2FC gap between categories represents a ~7-fold difference in expression trajectory.

This explains the GO enrichment results: ribosomal genes gain library proportion because they’re genuinely induced, not because they’re “less repressed.” The positive driver signal is real induction, not passive gain from metabolic collapse.

Compositional bias implication:

The divergent response (one category up, one category down) is incompatible with uniform compositional artifact. A multiplicative squeeze would shift both categories in the same direction by the same log2 amount. Opposite signs require differential biological regulation.

Key finding: Ribosomal genes are transiently induced at 30min while metabolic genes are repressed. K. lactis mounts an early translational burst before shutting down ribosome biogenesis later in the timecourse (visible by 2-4h in heatmaps).

# IMPROVED: Calculate threshold based on baseline variability
# uses cpm_normalized built from shared objects
# cpm_uncorrected and cpm_tmm from setup

# reshape data using helper function from setup
cpm_uncorr_long <- extract_cpm_data(cpm_uncorrected, candidate_ids, sample_meta, "Uncorrected CPM")
cpm_tmm_long <- extract_cpm_data(cpm_tmm, candidate_ids, sample_meta, "TMM-Corrected CPM")

cpm_combined <- bind_rows(cpm_uncorr_long, cpm_tmm_long) %>%
  mutate(
    type = factor(type, levels = c("Uncorrected CPM", "TMM-Corrected CPM")),
    timepoint_label = factor(timepoint_label, levels = paste0(sort(unique(timepoint_min)), "min"))
  )

cpm_summary <- cpm_combined %>%
  group_by(gene_name, timepoint_min, timepoint_label, type) %>%
  summarise(
    mean_cpm = mean(cpm),
    sd_cpm = sd(cpm),
    se_cpm = sd(cpm) / sqrt(n()),
    .groups = "drop"
  )

cpm_normalized <- cpm_summary %>%
  group_by(gene_name, type) %>%
  mutate(
    baseline = mean_cpm[timepoint_min == 0],
    baseline_sd = sd_cpm[timepoint_min == 0],
    relative_cpm = mean_cpm / baseline * 100,
    baseline_cv = baseline_sd / baseline * 100
  ) %>%
  ungroup()

# use 2 SD of baseline as "meaningful change" threshold
baseline_stats <- cpm_normalized %>%
  filter(timepoint_min == 0) %>%
  group_by(type) %>%
  summarise(
    mean_cv = mean(baseline_cv, na.rm = TRUE),
    .groups = "drop"
  )

threshold_pct <- 2 * mean(baseline_stats$mean_cv, na.rm = TRUE)
threshold_pct <- max(threshold_pct, 5) # minimum 5% to avoid noise



write_tsv(cpm_normalized, here("06-comp-bias", "outputs", "cpm_comparison_normalized_improved.tsv"))

interpretation <- cpm_normalized %>%
  filter(timepoint_min == 30) %>%
  select(gene_name, type, relative_cpm) %>%
  pivot_wider(names_from = type, values_from = relative_cpm) %>%
  mutate(
    uncorr_change = `Uncorrected CPM` - 100,
    tmm_change = `TMM-Corrected CPM` - 100,
    interpretation = case_when(
      uncorr_change < -threshold_pct & tmm_change > -threshold_pct ~ "Dilution (artifact)",
      uncorr_change < -threshold_pct & tmm_change < -threshold_pct ~ "True repression",
      uncorr_change > -threshold_pct & tmm_change > threshold_pct ~ "Masked induction",
      uncorr_change > threshold_pct & tmm_change > threshold_pct ~ "True induction",
      TRUE ~ "Stable (within noise)"
    ),
    threshold_used = threshold_pct
  )

write_tsv(interpretation, here("06-comp-bias", "outputs", "interpretation_30min_improved.tsv"))


print(interpretation %>% select(gene_name, uncorr_change, tmm_change, interpretation))
## # A tibble: 8 × 4
##   gene_name uncorr_change tmm_change interpretation  
##   <chr>             <dbl>      <dbl> <chr>           
## 1 ACT1              -19.0      -12.4 True repression 
## 2 ATG8              -95.9      -95.6 True repression 
## 3 CTT1              -98.5      -98.4 True repression 
## 4 PHO89           10442.     11303.  Masked induction
## 5 RPL3               70.4       84.2 Masked induction
## 6 RPS3               86.7      102.  Masked induction
## 7 TDH3              -89.6      -88.7 True repression 
## 8 TEF2               44.2       55.9 Masked induction
p1 <- ggplot(cpm_normalized, aes(x = timepoint_label, y = relative_cpm, color = type, group = type)) +
  geom_ribbon(
    data = cpm_normalized %>% filter(type == "Uncorrected CPM"),
    aes(ymin = 100 - threshold_pct, ymax = 100 + threshold_pct),
    fill = "gray90", alpha = 0.5, color = NA
  ) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  geom_hline(yintercept = 100, linetype = "dashed", color = "gray50") +
  facet_wrap(~gene_name, scales = "free_y", ncol = 4) +
  scale_color_manual(values = c("Uncorrected CPM" = "gray50", "TMM-Corrected CPM" = "firebrick")) +
  labs(
    title = "Diagnostic 4 (Improved): Dilution vs Repression",
    subtitle = paste0("Gray band = +/-", round(threshold_pct, 1), "% (2x baseline CV); changes within band are noise"),
    x = "Timepoint",
    y = "Relative CPM (% of baseline)",
    color = NULL
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
    legend.position = "bottom",
    strip.text = element_text(face = "bold")
  )

ggsave(here("06-comp-bias", "figures", "04_dilution_vs_repression_improved.pdf"), p1, width = 12, height = 8)
ggsave(here("06-comp-bias", "figures", "04_dilution_vs_repression_improved.png"), p1, width = 12, height = 8, dpi = 150)

p1

## Diagnostic 4: Dilution vs Repression

Rationale: Comparing uncorrected vs TMM-corrected CPM trajectories isolates the artifact component of observed expression changes. This diagnostic answers: “For a given gene’s apparent change at 30min, how much is compositional artifact vs genuine regulation?”

Mathematical basis:

Uncorrected CPM: \[CPM_{uncorr} = \frac{Count}{LibSize} \times 10^6\]

TMM-corrected CPM: \[CPM_{TMM} = \frac{Count}{LibSize \times NF} \times 10^6\]

At 30min, \(NF \approx 0.85\), so: \[CPM_{TMM} = \frac{CPM_{uncorr}}{0.85} \approx 1.18 \times CPM_{uncorr}\]

TMM inflates normalized expression by ~18% to correct for the dilution artifact. If a gene’s uncorrected CPM drops but TMM-corrected CPM recovers, the drop was artifact. If both drop, the repression is real.

Threshold derivation:

Rather than arbitrary cutoffs (e.g., “10% change”), I calculated a data-driven threshold based on baseline biological variability:

\[threshold = 2 \times \overline{CV}_{baseline}\]

where \(CV = \frac{SD}{mean} \times 100\%\) is the coefficient of variation across biological replicates at t=0. Changes within this band are indistinguishable from technical/biological noise.

Interpretation logic:

Uncorrected TMM-Corrected Interpretation
below threshold recovers above threshold Dilution artifact
below threshold stays below threshold True repression
stable above threshold Masked induction
above threshold above threshold True induction
within band within band Stable (noise)

Candidate genes selected:

Expected results based on prior analyses:

Gene Category Expected
PHO89 PHO pathway True induction
TDH3 Glycolysis True repression
RPL3, RPS3, TEF2 Ribosomal Masked or true induction
ACT1 Housekeeping Stable or dilution artifact
ATG8 Autophagy Transient dip (complex)
CTT1 Stress Variable

Limitations:

  1. TMM assumes most genes are not DE. If this assumption is violated (global transcriptional remodeling), TMM correction itself may be biased.
  2. The threshold is derived from baseline variability, which may not capture stress-induced heterogeneity.

Key finding: This diagnostic provides gene-level decomposition of artifact vs biology. Genes showing divergent trajectories (uncorrected drops, TMM-corrected recovers) have their apparent repression explained by compositional artifact. Genes where both trajectories drop show genuine repression that TMM cannot, and should not, correct.

# candidate genes and ids defined in setup
# candidate_ids now created in setup chunk

# calculate mean log2CPM at 0min and 30min for each candidate
breakdown <- candidate_ids %>%
  rowwise() %>%
  mutate(
    # uncorrected log2CPM (from setup)
    uncorr_0min = mean(log2cpm_uncorrected[salmon_id, samples_0]),
    uncorr_30min = mean(log2cpm_uncorrected[salmon_id, samples_30]),
    # TMM-corrected log2CPM (log2cpm from setup)
    tmm_0min = mean(log2cpm[salmon_id, samples_0]),
    tmm_30min = mean(log2cpm[salmon_id, samples_30]),
    # drops (positive = downregulation)
    observed_drop = uncorr_0min - uncorr_30min,
    corrected_drop = tmm_0min - tmm_30min,
    # artifact component (what TMM adds back)
    artifact_component = observed_drop - corrected_drop,
    # percentages
    pct_artifact = ifelse(observed_drop > 0, artifact_component / observed_drop * 100, NA),
    pct_real = ifelse(observed_drop > 0, corrected_drop / observed_drop * 100, NA)
  ) %>%
  ungroup() %>%
  select(
    gene_name, uncorr_0min, uncorr_30min, tmm_0min, tmm_30min,
    observed_drop, corrected_drop, artifact_component, pct_artifact, pct_real
  )

write_tsv(breakdown, here("06-comp-bias", "outputs", "artifact_vs_real_breakdown.tsv"))

breakdown_display <- breakdown %>%
  mutate(
    across(where(is.numeric), ~ round(., 2)),
    interpretation = case_when(
      observed_drop < 0 ~ "Induced",
      pct_real > 90 ~ "Mostly real repression",
      pct_real > 50 ~ "Mixed (majority real)",
      pct_real > 10 ~ "Mixed (majority artifact)",
      TRUE ~ "Mostly artifact"
    )
  )

print(breakdown_display)
## # A tibble: 8 × 11
##   gene_name uncorr_0min uncorr_30min tmm_0min tmm_30min observed_drop
##   <chr>           <dbl>        <dbl>    <dbl>     <dbl>         <dbl>
## 1 ACT1            11.5         11.2     11.7      11.5           0.3 
## 2 ATG8             5.05         0.57     5.2       0.8           4.48
## 3 CTT1            12.1          6.01    12.2       6.26          6.06
## 4 PHO89            5.44        12.2      5.58     12.4          -6.72
## 5 RPL3            11.7         12.5     11.9      12.7          -0.77
## 6 RPS3            10.9         11.8     11.0      12.1          -0.9 
## 7 TDH3            14.5         11.2     14.6      11.5           3.26
## 8 TEF2            14.0         14.6     14.2      14.8          -0.53
## # ℹ 5 more variables: corrected_drop <dbl>, artifact_component <dbl>,
## #   pct_artifact <dbl>, pct_real <dbl>, interpretation <chr>
breakdown_long <- breakdown %>%
  filter(observed_drop > 0) %>%
  select(gene_name, artifact_component, corrected_drop) %>%
  pivot_longer(-gene_name, names_to = "component", values_to = "log2_drop") %>%
  mutate(
    component = factor(component,
      levels = c("artifact_component", "corrected_drop"),
      labels = c("Compositional Artifact", "Real Biology")
    ),
    log2_drop = pmax(0, log2_drop)
  )

p1 <- ggplot(breakdown_long, aes(x = reorder(gene_name, -log2_drop), y = log2_drop, fill = component)) +
  geom_col(position = "stack") +
  scale_fill_manual(values = c("Compositional Artifact" = "gray70", "Real Biology" = "firebrick")) +
  labs(
    title = "Decomposition of 30min Expression Dip",
    subtitle = "How much of the observed drop is artifact vs genuine repression?",
    x = NULL,
    y = "Log2 Drop from Baseline",
    fill = NULL
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

ggsave(here("06-comp-bias", "figures", "05_artifact_vs_real_decomposition.pdf"), p1, width = 8, height = 5)
ggsave(here("06-comp-bias", "figures", "05_artifact_vs_real_decomposition.png"), p1, width = 8, height = 5, dpi = 150)

breakdown_pct <- breakdown %>%
  filter(observed_drop > 0) %>%
  mutate(
    pct_artifact_clamped = pmax(0, pmin(100, pct_artifact)),
    pct_real_clamped = pmax(0, pmin(100, pct_real))
  ) %>%
  select(gene_name, pct_artifact_clamped, pct_real_clamped) %>%
  pivot_longer(-gene_name, names_to = "component", values_to = "percentage") %>%
  mutate(
    component = factor(component,
      levels = c("pct_artifact_clamped", "pct_real_clamped"),
      labels = c("Compositional Artifact", "Real Biology")
    )
  )

p2 <- ggplot(breakdown_pct, aes(x = reorder(gene_name, percentage), y = percentage, fill = component)) +
  geom_col(position = "stack") +
  geom_hline(yintercept = 50, linetype = "dashed", color = "black") +
  scale_fill_manual(values = c("Compositional Artifact" = "gray70", "Real Biology" = "firebrick")) +
  coord_flip() +
  labs(
    title = "Percentage Breakdown: Artifact vs Real Biology",
    subtitle = "Dashed line = 50%",
    x = NULL,
    y = "Percentage of Observed Drop",
    fill = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

ggsave(here("06-comp-bias", "figures", "05_artifact_vs_real_percentage.pdf"), p2, width = 8, height = 5)
ggsave(here("06-comp-bias", "figures", "05_artifact_vs_real_percentage.png"), p2, width = 8, height = 5, dpi = 150)

p1

p2

## Diagnostic 5: Quantifying Artifact vs Real Biology

Rationale: Diagnostic 4 categorized genes qualitatively (artifact vs repression). This analysis quantifies the exact contribution of each component to the observed expression change at 30min.

Mathematical decomposition:

For genes showing an apparent drop in expression:

\[\text{Observed Drop} = \log_2 CPM_{uncorr,0} - \log_2 CPM_{uncorr,30}\]

\[\text{Corrected Drop} = \log_2 CPM_{TMM,0} - \log_2 CPM_{TMM,30}\]

\[\text{Artifact Component} = \text{Observed Drop} - \text{Corrected Drop}\]

The artifact component represents what TMM “adds back”. This is the portion of the apparent drop that was due to compositional squeeze rather than genuine repression.

Percentage breakdown:

\[\%_{artifact} = \frac{\text{Artifact Component}}{\text{Observed Drop}} \times 100\]

\[\%_{real} = \frac{\text{Corrected Drop}}{\text{Observed Drop}} \times 100\]

Note: This decomposition only applies to genes with \(\text{Observed Drop} > 0\) (apparent repression). For induced genes, the framework inverts. Key finding: For genes showing substantial repression (e.g., glycolytic enzymes with log2FC of -3 to -4), the artifact component (~0.1-0.2 log2) accounts for only 2-5% of the observed drop. The remaining 95-98% represents genuine biological repression. For genes with small apparent changes (near the noise floor), artifact can dominate. This confirms that the large metabolic repression signal is tenatively real biology, not normalization artifact.

output_dir <- here("06-comp-bias", "figures")
dir.create(output_dir, showWarnings = FALSE)

pdf(file.path(output_dir, "klac_geneset_heatmaps.pdf"), width = 10, height = 14)

# make_heatmap function and gene lists defined in setup
ribo_mat <- make_heatmap(ribosomal, "Ribosomal")
## 
## Ribosomal (n=43):
##   30min: 1.28 | 1h: 0.07 | 2h: -5.04

metab_mat <- make_heatmap(metabolic, "Metabolic")
## 
## Metabolic (n=24):
##   30min: -1.73 | 1h: -0.03 | 2h: -0.05

pho_mat <- make_heatmap(pho, "PHO")
## 
## PHO (n=11):
##   30min: 2.83 | 1h: 3.16 | 2h: 3.00

atg_mat <- make_heatmap(autophagy, "Autophagy")
## 
## Autophagy (n=19):
##   30min: -1.76 | 1h: 0.41 | 2h: 0.88

dev.off()
## pdf 
##   3

K. lactis Gene Set Heatmaps

Rationale: Visualize pathway-level temporal dynamics across the entire phosphate starvation timecourse (0-8h). This provides context for the 30min compositional bias question by showing how different functional categories evolve over time.

Method:

  1. TMM-normalized log2CPM calculated for all genes
  2. Fold change computed relative to t=0 baseline (mean of replicates)
  3. Biological replicates averaged per timepoint
  4. K. lactis gene IDs mapped to S. cerevisiae orthologs via HOG IDs for interpretable gene names
  5. Heatmaps generated with fixed color scale (blue = -4, white = 0, red = +4 log2FC)
  6. Rows clustered hierarchically; columns (timepoints) in chronological order

Gene sets:

Pathway n genes Composition
Ribosomal 67 Large subunit (RPL), small subunit (RPS), translation factors (TEF1/2, EFB1, YEF3), ribosomal stalk (RPP0, RPP1A, RPP2A)
Metabolic 37 Glycolysis (HXK, PYK), Fermentation (PDC, ADH, ALD), TCA cycle (CIT1, MDH), Respiration (COX, CYC, ATP synthase)
PHO 16 Acid phosphatases (PHO5/11/12), transporters (PHO84/89), signaling (PHO80/81/85/4), vacuolar (VTC1-4)
Autophagy 21 Core machinery (ATG1-18), selective autophagy (ATG11, ATG32), regulatory (VPS15/30/34)

Results summary:

Pathway 30min 1h 2h 4-8h
Ribosomal Induced (light red) Induced Transitioning Repressed (blue)
Metabolic Split (glycolysis blue, TCA light blue) Repressed Repressed Strongly repressed
PHO Mild induction Induced Peak induction Sustained
Autophagy Transient dip Induced Induced Sustained

Key observations:

  1. Ribosomal genes are transiently induced at 30min, not repressed. This directly contradicts the S. cerevisiae paradigm of immediate ribosomal shutdown during phosphate starvation. The induction is visible as light red/pink coloring at t0.5h, transitioning to blue by t2-4h.

  2. Metabolic heatmap reveals internal heterogeneity. Upper cluster (TCA/respiration: MDH2, FUM1, ACO1, IDH1/2, ATP1-3) shows mild repression (light blue). Lower cluster (glycolysis: TDH3, PGK1, ENO1, FBA1, PFK1/2) shows strong repression (dark blue). This is NOT a uniform response.

  3. PHO pathway shows gradual induction, peaking around 2-4h rather than immediate maximal response. PHO89 and SPL2 are among the strongest responders.

  4. Autophagy shows a transient dip at 30min followed by sustained induction. ATG8 in particular shows this pattern. consistent with a “pause before activation” model where cells halt basal processes before committing to stress response programs.

Implications for compositional bias:

The pathway-specific divergence at 30min is incompatible with uniform compositional artifact:

This is the strongest visual evidence that the 30min expression patterns reflect genuine differential regulation, not artifact.

Visualization notes:

Additional Notes/limitations:

  1. Paralog handling: When multiple K. lactis genes map to the same S. cerevisiae name, only the first is retained.

  2. Fixed color scale: Highly induced genes (PHO89 at +6-8 log2FC) and highly repressed genes (TDH3 at -6 to -10 log2FC) are clipped, potentially masking the full dynamic range.

  3. Gene set curation: Sets are based on S. cerevisiae analysis performed in October (Gurvich & Liu Analysis)

Key finding: K. lactis exhibits a biphasic response to phosphate starvation: early translational burst (30min-1h) followed by ribosomal shutdown (2-8h), coupled with immediate selective metabolic repression. This temporal program differs from S. cerevisiae and may reflect species-specific adaptation.

# metabolic bifurcation - uses objects from setup and heatmaps
# get_fc function to extract fold changes
get_fc <- function(genes, label) {
  ids <- kl_map %>% filter(sc_name %in% genes, salmon_id %in% rownames(log2fc_avg))
  data.frame(gene = ids$sc_name, log2fc = log2fc_avg[ids$salmon_id, "t0.5h"], pathway = label)
}

plot_data <- bind_rows(get_fc(glycolysis, "Glycolysis"), get_fc(tca, "TCA Cycle"))

ttest <- t.test(log2fc ~ pathway, data = plot_data)

p <- ggplot(plot_data, aes(x = pathway, y = log2fc, fill = pathway)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_boxplot(outlier.shape = NA, alpha = 0.8, width = 0.6) +
  geom_jitter(width = 0.15, alpha = 0.6, size = 2) +
  scale_fill_manual(values = c("Glycolysis" = "steelblue", "TCA Cycle" = "firebrick")) +
  labs(
    title = "Metabolic Bifurcation at 30 Minutes",
    subtitle = sprintf("p = %.2e", ttest$p.value),
    y = "Log2 Fold Change (30min vs 0min)", x = NULL
  ) +
  theme_minimal() +
  theme(legend.position = "none")

p

ggsave(here("06-comp-bias", "figures", "metabolic_bifurcation_30min.pdf"), p, width = 5, height = 5)

Metabolic Bifurcation Analysis

Rationale: The metabolic heatmap revealed apparent heterogeneity within central carbon metabolism. Glycolysis appeared more strongly repressed than TCA cycle. This analysis formally tests whether glycolysis and TCA show statistically distinguishable responses at 30min, which is critical for the compositional bias argument.

Interpretation:

Both pathways are repressed at 30min, but glycolysis is hit ~3x harder than TCA (~2.3 log2FC difference, equivalent to ~5-fold difference in expression ratio). This is evidence of differential repression magnitude differences within a single metabolic module.

Compositional squeeze applies a uniform multiplicative factor to all non-induced genes:

\[CPM_{observed} = CPM_{true} \times \frac{1}{1 + \Delta_{PHO}}\]

where \(\Delta_{PHO}\) is the proportional expansion of induced genes. In log2 space, this becomes an additive constant:

\[\log_2 FC_{observed} = \log_2 FC_{true} - C\]

where \(C \approx 0.12\) based on TMM factors. While I did not show it, a crucial note is that C becomes a minus factor due to the normalization factor being < 1. The resultant constant is thus negative.

The key insight: A uniform constant \(C\) shifts ALL genes by the same log2 amount. It cannot produce a 2.3 log2FC gap between two gene sets experiencing identical library composition.

\[\log_2 FC_{glyc} - \log_2 FC_{TCA} = (\log_2 FC_{glyc,true} - C) - (\log_2 FC_{TCA,true} - C)\] \[= \log_2 FC_{glyc,true} - \log_2 FC_{TCA,true}\]

The \(C\) terms cancel. The observed gap equals the true biological gap.

Conclusion:

The 2.3 log2FC difference between glycolysis and TCA at 30min is biological. Compositional artifact contributes ~0.12 log2 to each pathway’s apparent fold change, but this shifts both equally and does not affect their difference. Pathway membership predicts trajectory magnitude. This is selective regulation, not uniform artifact.

#housekeeping gene validation using Teste et al. 2009 reference genes
#validated stable: ALG9, TAF10, TFC1, UBC6
#invalidated (unstable): ACT1, TDH3, PDA1, IPP1

housekeeping_validated <- c("ALG9", "TAF10", "TFC1", "UBC6")
housekeeping_traditional <- c("ACT1", "TDH3", "PDA1", "IPP1")
housekeeping_all <- c(housekeeping_validated, housekeeping_traditional)

hk_ids <- id_to_name %>%
  filter(gene_name %in% housekeeping_all) %>%
  filter(salmon_id %in% rownames(log2fc_avg)) %>%
  distinct(gene_name, .keep_all = TRUE)

hk_fc_full <- hk_ids %>%
  rowwise() %>%
  mutate(data = list(data.frame(
    timepoint = colnames(log2fc_avg),
    log2fc = as.numeric(log2fc_avg[salmon_id, ])
  ))) %>%
  unnest(data) %>%
  mutate(
    category = ifelse(gene_name %in% housekeeping_validated, 
                      "Validated Stable (Teste 2009)", 
                      "Traditional (Invalidated)"),
    timepoint_hrs = as.numeric(gsub("t|h", "", timepoint)),
    timepoint = factor(timepoint, levels = tp_labels)
  ) %>%
  ungroup()

write_tsv(hk_fc_full, here("06-comp-bias", "outputs", "housekeeping_validation_full.tsv"))

#summary at 30min
hk_30min <- hk_fc_full %>% filter(timepoint == "t0.5h")
hk_summary <- hk_30min %>%
  group_by(category) %>%
  summarise(
    n = n(),
    mean_log2fc = mean(log2fc),
    sd_log2fc = sd(log2fc),
    .groups = "drop"
  )
print(hk_summary)
## # A tibble: 2 × 4
##   category                          n mean_log2fc sd_log2fc
##   <chr>                         <int>       <dbl>     <dbl>
## 1 Traditional (Invalidated)         4      -0.894     1.50 
## 2 Validated Stable (Teste 2009)     4      -0.281     0.633
#statistical test at 30min
if(nrow(hk_30min) >= 4) {
  t_hk <- t.test(log2fc ~ category, data = hk_30min)
  cat(sprintf("\nValidated vs Traditional at 30min: p = %.3f\n", t_hk$p.value))
}
## 
## Validated vs Traditional at 30min: p = 0.494
p1 <- ggplot(hk_fc_full, aes(x = timepoint, y = log2fc, color = gene_name, group = gene_name)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_hline(yintercept = c(-0.5, 0.5), linetype = "dotted", color = "gray80") +
  geom_line(linewidth = 1, alpha = 0.8) +
  geom_point(size = 2) +
  facet_wrap(~category, ncol = 1) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Housekeeping Gene Expression Across Phosphate Starvation",
    subtitle = "Teste et al. 2009: Validated genes should remain near zero; traditional genes may vary",
    x = "Timepoint",
    y = "Log2 Fold Change (vs t0)",
    color = NULL
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "right"
  )

ggsave(here("06-comp-bias", "figures", "07_housekeeping_timecourse.pdf"), p1, width = 10, height = 7)
ggsave(here("06-comp-bias", "figures", "07_housekeeping_timecourse.png"), p1, width = 10, height = 7, dpi = 150)

#visualization 2: category means with individual genes as ribbons
hk_category_summary <- hk_fc_full %>%
  group_by(category, timepoint, timepoint_hrs) %>%
  summarise(
    mean_fc = mean(log2fc),
    sd_fc = sd(log2fc),
    min_fc = min(log2fc),
    max_fc = max(log2fc),
    .groups = "drop"
  )

p2 <- ggplot(hk_category_summary, aes(x = timepoint, y = mean_fc, color = category, group = category)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_ribbon(aes(ymin = min_fc, ymax = max_fc, fill = category), alpha = 0.2, color = NA) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(values = c("Validated Stable (Teste 2009)" = "forestgreen", 
                                 "Traditional (Invalidated)" = "darkorange")) +
  scale_fill_manual(values = c("Validated Stable (Teste 2009)" = "forestgreen", 
                                "Traditional (Invalidated)" = "darkorange")) +
  labs(
    title = "Housekeeping Gene Categories: Mean Expression",
    subtitle = "Ribbon = range (min to max); line = mean",
    x = "Timepoint",
    y = "Log2 Fold Change (vs t0)",
    color = NULL, fill = NULL
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

ggsave(here("06-comp-bias", "figures", "07_housekeeping_category_means.pdf"), p2, width = 9, height = 5)
ggsave(here("06-comp-bias", "figures", "07_housekeeping_category_means.png"), p2, width = 9, height = 5, dpi = 150)

p1

p2

#artifact decomposition for validated housekeeping genes
#same math as Diagnostic 5, applied to Teste et al. validated genes

hk_breakdown <- hk_ids %>%
  filter(gene_name %in% housekeeping_validated) %>%
  rowwise() %>%
  mutate(
    #uncorrected log2CPM
    uncorr_0min = mean(log2cpm_uncorrected[salmon_id, samples_0]),
    uncorr_30min = mean(log2cpm_uncorrected[salmon_id, samples_30]),
    #TMM-corrected log2CPM
    tmm_0min = mean(log2cpm[salmon_id, samples_0]),
    tmm_30min = mean(log2cpm[salmon_id, samples_30]),
    #decomposition
    observed_drop = uncorr_0min - uncorr_30min,
    corrected_drop = tmm_0min - tmm_30min,
    artifact_component = observed_drop - corrected_drop,
    #percentages (only valid if observed_drop > 0)
    pct_artifact = ifelse(observed_drop > 0, artifact_component / observed_drop * 100, NA),
    pct_real = ifelse(observed_drop > 0, corrected_drop / observed_drop * 100, NA)
  ) %>%
  ungroup() %>%
  select(gene_name, observed_drop, corrected_drop, artifact_component, pct_artifact, pct_real)

print(hk_breakdown)
## # A tibble: 4 × 6
##   gene_name observed_drop corrected_drop artifact_component pct_artifact
##   <chr>             <dbl>          <dbl>              <dbl>        <dbl>
## 1 TFC1              0.291          0.179              0.112         38.5
## 2 TAF10            -0.450         -0.562              0.112         NA  
## 3 ALG9              0.994          0.882              0.112         11.3
## 4 UBC6              0.738          0.627              0.112         15.1
## # ℹ 1 more variable: pct_real <dbl>
write_tsv(hk_breakdown, here("06-comp-bias", "outputs", "housekeeping_artifact_decomposition.tsv"))

#visualization
hk_decomp_long <- hk_breakdown %>%
  filter(observed_drop > 0) %>%
  select(gene_name, artifact_component, corrected_drop) %>%
  pivot_longer(-gene_name, names_to = "component", values_to = "log2_drop") %>%
  mutate(
    component = factor(component,
      levels = c("artifact_component", "corrected_drop"),
      labels = c("Compositional Artifact", "Real Biology")
    ),
    log2_drop = pmax(0, log2_drop)
  )

if(nrow(hk_decomp_long) > 0) {
  p <- ggplot(hk_decomp_long, aes(x = reorder(gene_name, -log2_drop), y = log2_drop, fill = component)) +
    geom_col(position = "stack") +
    scale_fill_manual(values = c("Compositional Artifact" = "gray70", "Real Biology" = "forestgreen")) +
    labs(
      title = "Validated Housekeeping Genes: 30min Dip Decomposition",
      subtitle = "How much of the observed drop is artifact vs genuine change?",
      x = NULL,
      y = "Log2 Drop from Baseline",
      fill = NULL
    ) +
    theme_minimal() +
    theme(legend.position = "bottom")
  
  ggsave(here("06-comp-bias", "figures", "07_housekeeping_artifact_decomposition.pdf"), p, width = 6, height = 4)
  ggsave(here("06-comp-bias", "figures", "07_housekeeping_artifact_decomposition.png"), p, width = 6, height = 4, dpi = 150)
  
  print(p)
}

## Housekeeping Gene Validation

Teste et al. (2009) demonstrated that commonly used “housekeeping” genes in yeast—ACT1, TDH3, PDA1, IPP1—are not actually stable across growth conditions. They validated ALG9, TAF10, TFC1, and UBC6 as truly stable reference genes using geNorm analysis across diverse physiological states.

I applied this framework to ask: do the validated stable genes remain stable during phosphate starvation? And for genes showing 30min dips, how much is artifact vs biology?

The timecourse reveals a clear separation. TAF10 and UBC6 remain within ±0.5 log2FC across the entire 8-hour experiment. ALG9 and TFC1 show modest transient dips at 30min but recover. In contrast, TDH3 drops ~3 log2 and stays down—behaving exactly like the glycolytic enzyme it is, not like a housekeeping gene.

Applying the artifact decomposition to the validated genes shows a consistent pattern: the artifact component is ~0.1 log2 regardless of gene, matching the TMM-derived estimate. What varies is the real biology portion. TFC1 shows only ~0.2 log2 genuine change after correction—essentially flat. ALG9 shows ~0.9 log2, suggesting phosphate limitation does mildly affect ER glycosylation. But even this is 3-4× smaller than TDH3.

The key point is that Teste et al. validated these genes for glucose growth transitions, not acute nutrient starvation. Phosphate limitation is an arguably more severe perturbation, so some response from ALG9 and UBC6 is not surprising. What matters is the magnitude: validated housekeeping genes show small changes (~0.2–0.9 log2), while metabolic genes show large changes (~3 log2 for glycolysis). The ~0.1 log2 artifact affects everything equally but cannot explain the 10-fold difference in response magnitude between pathways.

This analysis closes the loop on the compositional bias investigation. The artifact is real, quantifiable (~0.1 log2), and uniform. TMM corrects for it. After correction, truly stable genes remain stable, and the pathway-specific divergence—ribosomal up, glycolysis down, TCA intermediate—reflects genuine differential regulation during phosphate starvation. It is just seemingly a quirk of K. lactis expression both Dr. He and I have stumbled upon by shear analysis of a variety of gene sets alone. And I’d argue, finding out this isn’t just compositional bias seems far cooler biologically than the former anyway!