No - genes from the same cell are not statistically independent. Gene expression is correlated due to:
Co-regulation (genes in same pathway)
Chromosomal proximity
Shared transcription factors
This means we use ind = FALSE in our FDR function, which
uses the more conservative BH procedure that accounts
for dependence.
library(ISLR)
data(NCI60)
source("fdr.pck")
table(NCI60$labs)
##
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA
## 7 5 7 1 1 6
## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE
## 1 1 8 9 6 2
## RENAL UNKNOWN
## 9 1
# Function to find interesting genes for one cancer type
get_interesting <- function(cancer, Q = 0.2, ind = FALSE) {
# Which samples belong to this cancer type
grp <- NCI60$labs == cancer
# Run a two-sample t-test for each gene (cancer vs all others)
pvals <- apply(NCI60$data, 2, function(g) {
t.test(g[grp], g[!grp])$p.value
})
# Apply FDR correction using our fdr() function
# ind = FALSE because genes are NOT independent (correlated expression)
out <- fdr(pvals, Q = Q, ind = ind)
# Handle both logical and index output from fdr()
if (is.logical(out$interesting)) {
interesting <- which(out$interesting) # convert TRUE/FALSE to indices
} else {
interesting <- out$interesting # already indices or names
}
# Compute mean expression inside vs outside the cancer group
mean_in <- colMeans(NCI60$data[grp, , drop = FALSE])
mean_out <- colMeans(NCI60$data[!grp, , drop = FALSE])
# Build results table
tab <- data.frame(
gene = interesting,
p_value = pvals[interesting],
mean_diff = mean_in[interesting] - mean_out[interesting]
)
# Sort by smallest p-value first
tab <- tab[order(tab$p_value), ]
rownames(tab) <- NULL
return(tab)
}
# ── Run analysis for each cancer type ──────────────────────────────────────────
mel <- get_interesting("MELANOMA", Q = 0.2, ind = FALSE)
cns <- get_interesting("CNS", Q = 0.2, ind = FALSE)
leu <- get_interesting("LEUKEMIA", Q = 0.2, ind = FALSE)
# ── How many interesting genes were found? ─────────────────────────────────────
cat("Melanoma interesting genes: ", nrow(mel), "\n")
## Melanoma interesting genes: 379
cat("CNS interesting genes: ", nrow(cns), "\n")
## CNS interesting genes: 81
cat("Leukemia interesting genes: ", nrow(leu), "\n")
## Leukemia interesting genes: 347
# ── Top genes for each cancer ──────────────────────────────────────────────────
cat("\n--- Top Melanoma Genes ---\n")
##
## --- Top Melanoma Genes ---
print(head(mel, 15))
## gene p_value mean_diff
## 1 4347 8.413460e-16 2.0999121
## 2 247 2.369564e-10 -2.0151775
## 3 196 1.335088e-09 -3.3335695
## 4 5467 2.604742e-09 1.7998241
## 5 5586 3.164027e-09 -2.5489275
## 6 4384 5.467477e-09 2.0351789
## 7 4255 5.831809e-09 1.4582157
## 8 6389 7.157991e-09 -1.2303498
## 9 4348 1.261400e-08 2.2564289
## 10 4256 1.506040e-08 1.3016078
## 11 6399 1.657162e-08 -0.7661597
## 12 4355 2.138624e-08 2.2433043
## 13 2024 3.116491e-08 -1.0046410
## 14 2233 3.212347e-08 0.5880368
## 15 5320 3.212880e-08 -0.8382132
cat("\n--- Top CNS Genes ---\n")
##
## --- Top CNS Genes ---
print(head(cns, 15))
## gene p_value mean_diff
## 1 5481 2.090845e-14 3.0291066
## 2 6686 3.317234e-11 1.7095010
## 3 5867 8.407955e-10 2.4878229
## 4 5482 9.839034e-10 1.2699925
## 5 5689 1.332537e-09 1.4608247
## 6 6689 2.568636e-09 2.2138227
## 7 4527 4.304442e-09 -0.6175505
## 8 1487 7.139663e-09 -0.9207189
## 9 5839 7.431886e-09 2.8423654
## 10 3174 1.001831e-08 -1.1023804
## 11 1968 5.215960e-08 -0.8247542
## 12 439 8.594930e-08 -0.6760414
## 13 5074 1.012065e-07 -0.4940926
## 14 6688 1.079841e-07 2.3098230
## 15 1715 1.804007e-07 -0.6045498
cat("\n--- Top Leukemia Genes ---\n")
##
## --- Top Leukemia Genes ---
print(head(leu, 15))
## gene p_value mean_diff
## 1 3933 7.470566e-13 -1.2753114
## 2 2320 7.557846e-13 1.6215851
## 3 2170 2.339842e-12 0.8099471
## 4 5872 5.713374e-12 -2.8830413
## 5 4244 7.284406e-11 -3.4433864
## 6 5868 2.186107e-10 -3.6918922
## 7 1343 5.734032e-10 0.6610965
## 8 1693 1.754387e-09 0.9770161
## 9 5949 3.754526e-09 -0.9428115
## 10 2215 7.124688e-09 0.3462690
## 11 1283 9.793830e-09 0.4944874
## 12 304 1.105311e-08 -1.0253684
## 13 2347 1.727125e-08 0.7853494
## 14 5878 1.829951e-08 -3.3778114
## 15 1692 1.949873e-08 0.8774184
# ── Find overlapping genes between cancer pairs ────────────────────────────────
common_mel_cns <- intersect(mel$gene, cns$gene)
common_mel_leu <- intersect(mel$gene, leu$gene)
common_cns_leu <- intersect(cns$gene, leu$gene)
cat("\n--- Common genes: Melanoma & CNS ---\n")
##
## --- Common genes: Melanoma & CNS ---
print(common_mel_cns)
## [1] 4348 257 5839 5868 5869 1013 5694 5804 4574 6388 5746
cat("Count:", length(common_mel_cns), "\n")
## Count: 11
cat("\n--- Common genes: Melanoma & Leukemia ---\n")
##
## --- Common genes: Melanoma & Leukemia ---
print(common_mel_leu)
## [1] 6399 6416 4288 251 5868 5869 4289 252 245 6391 5887 315 6369 4448 5872
## [16] 2025 4287 6414 6393 6392 1919 5804 5555 5356 4317 4380 5886 5556 4067 3874
## [31] 6081 1347 143 5557 2039
cat("Count:", length(common_mel_leu), "\n")
## Count: 35
cat("\n--- Common genes: CNS & Leukemia ---\n")
##
## --- Common genes: CNS & Leukemia ---
print(common_cns_leu)
## [1] 5867 5869 5868 6065 5930 5804 5880 6018 5917
cat("Count:", length(common_cns_leu), "\n")
## Count: 9
The plots show the ordered p-values for each gene when comparing a specific cancer type (Melanoma, CNS, and Leukemia) to all other cancers. The black curve represents all genes, while the red points highlight the genes that are considered statistically significant after applying the FDR correction with Q=0.2. The horizontal line represents the significance threshold.
In all three graphs, most p-values increase smoothly, which indicates that the majority of genes are not significantly different across cancer types. However, a small cluster of red points appears at the left side of each graph, showing that only a subset of genes have very low p-values and are considered “interesting.”
Comparing the three cancers, Melanoma and Leukemia have many more significant genes (379 and 347) compared to CNS (81). This suggests that Melanoma and Leukemia have stronger or more widespread gene expression differences relative to other cancers, while CNS has fewer genes that clearly stand out.
Overall, the graphs show that gene expression differences are concentrated in a small number of genes, and most genes are not strongly associated with a specific cancer type.
Yes.
| Cancer Pair | Common Genes | Count |
|---|---|---|
| Melanoma ∩ CNS | 4348, 257, 5839, 5868, 5869, 1013, 5694, 5804, 4574, 6388, 5746 | 11 genes |
| Melanoma ∩ Leukemia | 6399, 6416, 4288, 251, 5868, 5869, 4289, 252, 245, 6391… | 35 genes |
| CNS ∩ Leukemia | 5867, 5869, 5868, 6065, 5930, 5804, 5880, 6018, 5917 | 9 genes |
Melanoma & Leukemia share the most common genes (35) suggesting possible shared biological mechanisms
Genes 5868 and 5869 appear in ALL THREE cancer types — these may represent broadly dysregulated genes across multiple cancers
Gene 5804 appears in all three pairwise comparisons, making it a particularly noteworthy gene
The direction of expression (mean_diff) matters:
Positive mean_diff → gene is over-expressed in that cancer
Negative mean_diff → gene is under-expressed in that cancer
| Cancer | Sample Size | Interesting Genes Found |
|---|---|---|
| Melanoma | n = 8 | 379 genes |
| CNS | n = 5 | 81 genes |
| Leukemia | n = 6 | 347 genes |
CNS found the fewest interesting genes, which makes sense given it has the smallest sample size (n=5), making it harder to detect significant differences
Melanoma and Leukemia found substantially more interesting genes (~350+)
Melanoma:
Top gene: 4347 (p = 8.4e-16, mean_diff = +2.10) → over-expressed
Gene 247 (p = 2.4e-10, mean_diff = -2.02) → under-expressed
Gene 196 (p = 1.3e-09, mean_diff = -3.33) → strongly under-expressed
CNS:
Top gene: 5481 (p = 2.1e-14, mean_diff = +3.03) → strongly over-expressed
Gene 6686 (p = 3.3e-11, mean_diff = +1.71) → over-expressed
Gene 4527 (p = 4.3e-09, mean_diff = -0.62) → under-expressed
Leukemia:
Top gene: 3933 (p = 7.5e-13, mean_diff = -1.28) → under-expressed
Gene 2320 (p = 7.6e-13, mean_diff = +1.62) → over-expressed
Gene 5868 (p = 2.2e-10, mean_diff = -3.69) → strongly under-expressed