Cancer Transcriptomics with PCA & Multiple Testing

Why this topic?

Bioinformatics + Oncology often means analyzing high‑dimensional gene expression.

This mini case study shows:

Dimension reduction via PCA
Differential expression testing
Multiple testing control with FDR (Benjamini–Hochberg)

All data are simulated to mimic tumor RNA‑seq patterns (no PHI).

Dataset (simulated tumor transcriptomics)

We simulate an expression matrix:

\(n = 90\) patients
\(p = 500\) genes
3 tumor subtypes: Basal, Luminal, HER2

We embed subtype‑specific signal in a subset of genes.

library(ggplot2)
library(dplyr)
library(tidyr)
library(plotly)

set.seed(42)

Math: PCA in one slide

Given centered data matrix \(X \in \mathbb{R}^{n \times p}\):

\[ \text{PC}_k = X v_k \]

where \(v_k\) is the \(k\)-th eigenvector of the covariance matrix:

\[ S = \frac{1}{n-1}X^\top X,\quad S v_k = \lambda_k v_k \]

Interpretation: PCs are orthogonal directions capturing maximal variance.

Simulate data + run PCA

ggplot #1: PCA (2D)

plotly (3D): PCA exploration

Differential expression: Basal vs Luminal

We test each gene using a two‑sample t‑test.

Effect size: \(\log_2\) fold change (mean difference here, for simplicity)
Significance: p‑value per gene
Correct for multiplicity using BH-FDR

Math: Multiple testing & FDR (BH)

With \(m\) tests and ordered p-values \(p_{(1)} \le \dots \le p_{(m)}\), BH finds:

\[ k = \max\left\{i : p_{(i)} \le \frac{i}{m}\alpha \right\} \]

Then reject \(H_0\) for all \(p_{(1)},\dots,p_{(k)}\).

Goal: control the false discovery rate:

\[ \mathrm{FDR} = \mathbb{E}\left[\frac{V}{\max(R,1)}\right] \]

where \(V\) = false positives and \(R\) = total rejections.

ggplot #2: Volcano plot

ggplot #3: Heatmap of top genes

One slide with R code

# Differential expression (Basal vs Luminal)
idx_b <- subtype == "Basal"
idx_l <- subtype == "Luminal"

log2fc <- colMeans(X[idx_b,]) - colMeans(X[idx_l,])
pval   <- apply(X, 2, function(g) t.test(g[idx_b], g[idx_l])$p.value)
padj   <- p.adjust(pval, method = "BH")  # BH-FDR

Takeaways

PCA reveals subtype structure in high‑dimensional expression data.
Thousands of genes → many tests → use BH-FDR for reliable discoveries.
Visuals included:
- PCA (2D ggplot + 3D plotly)
- Volcano plot
- Heatmap of top genes