Why this topic?

Bioinformatics + Oncology often means analyzing high‑dimensional gene expression.

This mini case study shows:

  • Dimension reduction via PCA
  • Differential expression testing
  • Multiple testing control with FDR (Benjamini–Hochberg)

All data are simulated to mimic tumor RNA‑seq patterns (no PHI).

Dataset (simulated tumor transcriptomics)

We simulate an expression matrix:

  • \(n = 90\) patients
  • \(p = 500\) genes
  • 3 tumor subtypes: Basal, Luminal, HER2

We embed subtype‑specific signal in a subset of genes.

library(ggplot2)
library(dplyr)
library(tidyr)
library(plotly)

set.seed(42)

Math: PCA in one slide

Given centered data matrix \(X \in \mathbb{R}^{n \times p}\):

\[ \text{PC}_k = X v_k \]

where \(v_k\) is the \(k\)-th eigenvector of the covariance matrix:

\[ S = \frac{1}{n-1}X^\top X,\quad S v_k = \lambda_k v_k \]

Interpretation: PCs are orthogonal directions capturing maximal variance.

Simulate data + run PCA

ggplot #1: PCA (2D)

plotly (3D): PCA exploration

Differential expression: Basal vs Luminal

We test each gene using a two‑sample t‑test.

  • Effect size: \(\log_2\) fold change (mean difference here, for simplicity)
  • Significance: p‑value per gene
  • Correct for multiplicity using BH-FDR

Math: Multiple testing & FDR (BH)

With \(m\) tests and ordered p-values \(p_{(1)} \le \dots \le p_{(m)}\), BH finds:

\[ k = \max\left\{i : p_{(i)} \le \frac{i}{m}\alpha \right\} \]

Then reject \(H_0\) for all \(p_{(1)},\dots,p_{(k)}\).

Goal: control the false discovery rate:

\[ \mathrm{FDR} = \mathbb{E}\left[\frac{V}{\max(R,1)}\right] \]

where \(V\) = false positives and \(R\) = total rejections.

ggplot #2: Volcano plot

ggplot #3: Heatmap of top genes

One slide with R code

# Differential expression (Basal vs Luminal)
idx_b <- subtype == "Basal"
idx_l <- subtype == "Luminal"

log2fc <- colMeans(X[idx_b,]) - colMeans(X[idx_l,])
pval   <- apply(X, 2, function(g) t.test(g[idx_b], g[idx_l])$p.value)
padj   <- p.adjust(pval, method = "BH")  # BH-FDR

Takeaways

  • PCA reveals subtype structure in high‑dimensional expression data.
  • Thousands of genes → many tests → use BH-FDR for reliable discoveries.
  • Visuals included:
    • PCA (2D ggplot + 3D plotly)
    • Volcano plot
    • Heatmap of top genes