NCI60 FDR Analysis for Melanoma, CNS, and Leukemia

Key Question: Are genes statistically independent?

No - genes from the same cell are not statistically independent. Gene expression is correlated due to:

Co-regulation (genes in same pathway)
Chromosomal proximity
Shared transcription factors

This means we use ind = FALSE in our FDR function, which uses the more conservative BH procedure that accounts for dependence.

library(ISLR)
data(NCI60)
source("fdr.pck")

table(NCI60$labs)

## 
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1

# Function to find interesting genes for one cancer type
get_interesting <- function(cancer, Q = 0.2, ind = FALSE) {
  
  # Which samples belong to this cancer type
  grp <- NCI60$labs == cancer
  
  # Run a two-sample t-test for each gene (cancer vs all others)
  pvals <- apply(NCI60$data, 2, function(g) {
    t.test(g[grp], g[!grp])$p.value
  })
  
  # Apply FDR correction using our fdr() function
  # ind = FALSE because genes are NOT independent (correlated expression)
  out <- fdr(pvals, Q = Q, ind = ind)
  
   # Handle both logical and index output from fdr()
  if (is.logical(out$interesting)) {
    interesting <- which(out$interesting)  # convert TRUE/FALSE to indices
  } else {
    interesting <- out$interesting         # already indices or names
  }
  
  # Compute mean expression inside vs outside the cancer group
  mean_in  <- colMeans(NCI60$data[grp,  , drop = FALSE])
  mean_out <- colMeans(NCI60$data[!grp, , drop = FALSE])
  
  # Build results table
  tab <- data.frame(
    gene      = interesting,
    p_value   = pvals[interesting],
    mean_diff = mean_in[interesting] - mean_out[interesting]
  )
  
  # Sort by smallest p-value first
  tab <- tab[order(tab$p_value), ]
  rownames(tab) <- NULL
  
  return(tab)
}

# ── Run analysis for each cancer type ──────────────────────────────────────────
mel <- get_interesting("MELANOMA", Q = 0.2, ind = FALSE)

cns <- get_interesting("CNS",      Q = 0.2, ind = FALSE)

leu <- get_interesting("LEUKEMIA", Q = 0.2, ind = FALSE)

# ── How many interesting genes were found? ─────────────────────────────────────
cat("Melanoma interesting genes: ", nrow(mel), "\n")

## Melanoma interesting genes:  379

cat("CNS interesting genes:      ", nrow(cns), "\n")

## CNS interesting genes:       81

cat("Leukemia interesting genes: ", nrow(leu), "\n")

## Leukemia interesting genes:  347

# ── Top genes for each cancer ──────────────────────────────────────────────────
cat("\n--- Top Melanoma Genes ---\n")

## 
## --- Top Melanoma Genes ---

print(head(mel, 15))

##    gene      p_value  mean_diff
## 1  4347 8.413460e-16  2.0999121
## 2   247 2.369564e-10 -2.0151775
## 3   196 1.335088e-09 -3.3335695
## 4  5467 2.604742e-09  1.7998241
## 5  5586 3.164027e-09 -2.5489275
## 6  4384 5.467477e-09  2.0351789
## 7  4255 5.831809e-09  1.4582157
## 8  6389 7.157991e-09 -1.2303498
## 9  4348 1.261400e-08  2.2564289
## 10 4256 1.506040e-08  1.3016078
## 11 6399 1.657162e-08 -0.7661597
## 12 4355 2.138624e-08  2.2433043
## 13 2024 3.116491e-08 -1.0046410
## 14 2233 3.212347e-08  0.5880368
## 15 5320 3.212880e-08 -0.8382132

cat("\n--- Top CNS Genes ---\n")

## 
## --- Top CNS Genes ---

print(head(cns, 15))

##    gene      p_value  mean_diff
## 1  5481 2.090845e-14  3.0291066
## 2  6686 3.317234e-11  1.7095010
## 3  5867 8.407955e-10  2.4878229
## 4  5482 9.839034e-10  1.2699925
## 5  5689 1.332537e-09  1.4608247
## 6  6689 2.568636e-09  2.2138227
## 7  4527 4.304442e-09 -0.6175505
## 8  1487 7.139663e-09 -0.9207189
## 9  5839 7.431886e-09  2.8423654
## 10 3174 1.001831e-08 -1.1023804
## 11 1968 5.215960e-08 -0.8247542
## 12  439 8.594930e-08 -0.6760414
## 13 5074 1.012065e-07 -0.4940926
## 14 6688 1.079841e-07  2.3098230
## 15 1715 1.804007e-07 -0.6045498

cat("\n--- Top Leukemia Genes ---\n")

## 
## --- Top Leukemia Genes ---

print(head(leu, 15))

##    gene      p_value  mean_diff
## 1  3933 7.470566e-13 -1.2753114
## 2  2320 7.557846e-13  1.6215851
## 3  2170 2.339842e-12  0.8099471
## 4  5872 5.713374e-12 -2.8830413
## 5  4244 7.284406e-11 -3.4433864
## 6  5868 2.186107e-10 -3.6918922
## 7  1343 5.734032e-10  0.6610965
## 8  1693 1.754387e-09  0.9770161
## 9  5949 3.754526e-09 -0.9428115
## 10 2215 7.124688e-09  0.3462690
## 11 1283 9.793830e-09  0.4944874
## 12  304 1.105311e-08 -1.0253684
## 13 2347 1.727125e-08  0.7853494
## 14 5878 1.829951e-08 -3.3778114
## 15 1692 1.949873e-08  0.8774184

# ── Find overlapping genes between cancer pairs ────────────────────────────────
common_mel_cns <- intersect(mel$gene, cns$gene)
common_mel_leu <- intersect(mel$gene, leu$gene)
common_cns_leu <- intersect(cns$gene, leu$gene)

cat("\n--- Common genes: Melanoma & CNS ---\n")

## 
## --- Common genes: Melanoma & CNS ---

print(common_mel_cns)

##  [1] 4348  257 5839 5868 5869 1013 5694 5804 4574 6388 5746

cat("Count:", length(common_mel_cns), "\n")

## Count: 11

cat("\n--- Common genes: Melanoma & Leukemia ---\n")

## 
## --- Common genes: Melanoma & Leukemia ---

print(common_mel_leu)

##  [1] 6399 6416 4288  251 5868 5869 4289  252  245 6391 5887  315 6369 4448 5872
## [16] 2025 4287 6414 6393 6392 1919 5804 5555 5356 4317 4380 5886 5556 4067 3874
## [31] 6081 1347  143 5557 2039

cat("Count:", length(common_mel_leu), "\n")

## Count: 35

cat("\n--- Common genes: CNS & Leukemia ---\n")

## 
## --- Common genes: CNS & Leukemia ---

print(common_cns_leu)

## [1] 5867 5869 5868 6065 5930 5804 5880 6018 5917

cat("Count:", length(common_cns_leu), "\n")

## Count: 9

Graph Interpretations:

The plots show the ordered p-values for each gene when comparing a specific cancer type (Melanoma, CNS, and Leukemia) to all other cancers. The black curve represents all genes, while the red points highlight the genes that are considered statistically significant after applying the FDR correction with Q=0.2. The horizontal line represents the significance threshold.

In all three graphs, most p-values increase smoothly, which indicates that the majority of genes are not significantly different across cancer types. However, a small cluster of red points appears at the left side of each graph, showing that only a subset of genes have very low p-values and are considered “interesting.”

Comparing the three cancers, Melanoma and Leukemia have many more significant genes (379 and 347) compared to CNS (81). This suggests that Melanoma and Leukemia have stronger or more widespread gene expression differences relative to other cancers, while CNS has fewer genes that clearly stand out.

Overall, the graphs show that gene expression differences are concentrated in a small number of genes, and most genes are not strongly associated with a specific cancer type.

Are there any common interesting genes between any 2 cancers?

Yes.

Cancer Pair	Common Genes	Count
Melanoma ∩ CNS	4348, 257, 5839, 5868, 5869, 1013, 5694, 5804, 4574, 6388, 5746	11 genes
Melanoma ∩ Leukemia	6399, 6416, 4288, 251, 5868, 5869, 4289, 252, 245, 6391…	35 genes
CNS ∩ Leukemia	5867, 5869, 5868, 6065, 5930, 5804, 5880, 6018, 5917	9 genes

Observations

Melanoma & Leukemia share the most common genes (35) suggesting possible shared biological mechanisms
Genes 5868 and 5869 appear in ALL THREE cancer types — these may represent broadly dysregulated genes across multiple cancers
Gene 5804 appears in all three pairwise comparisons, making it a particularly noteworthy gene
The direction of expression (mean_diff) matters:
- Positive mean_diff → gene is over-expressed in that cancer
- Negative mean_diff → gene is under-expressed in that cancer

Interesting Genes Found Per Cancer

Cancer	Sample Size	Interesting Genes Found
Melanoma	n = 8	379 genes
CNS	n = 5	81 genes
Leukemia	n = 6	347 genes

CNS found the fewest interesting genes, which makes sense given it has the smallest sample size (n=5), making it harder to detect significant differences
Melanoma and Leukemia found substantially more interesting genes (~350+)

Top Genes Per Cancer

Melanoma:

Top gene: 4347 (p = 8.4e-16, mean_diff = +2.10) → over-expressed
Gene 247 (p = 2.4e-10, mean_diff = -2.02) → under-expressed
Gene 196 (p = 1.3e-09, mean_diff = -3.33) → strongly under-expressed

CNS:

Top gene: 5481 (p = 2.1e-14, mean_diff = +3.03) → strongly over-expressed
Gene 6686 (p = 3.3e-11, mean_diff = +1.71) → over-expressed
Gene 4527 (p = 4.3e-09, mean_diff = -0.62) → under-expressed

Leukemia:

Top gene: 3933 (p = 7.5e-13, mean_diff = -1.28) → under-expressed
Gene 2320 (p = 7.6e-13, mean_diff = +1.62) → over-expressed
Gene 5868 (p = 2.2e-10, mean_diff = -3.69) → strongly under-expressed

Assignment 7 - NCI60 FDR Analysis for Melanoma, CNS, and Leukemia