Question 1: Conceptual Methodology Behind Principal Component Analysis

PCA (Principal Component Analysis) is a method for reducing a large set of variables into a smaller set of summary variables while preserving most of the essential information. In many experiments, thousands of genes are measured in a relatively small number of samples. PCA identifies patterns in those measurements and creates new variables, called principal components, that capture the major differences among samples. The first principal component represents the combination of genes that varies the most across all samples. The second principal component captures the next largest source of variation– ensuring it is uncorrelated with the first. This process continues until every component is defined, although typically only the first few components are needed.

By plotting each sample’s scores on the first two or three principal components, researchers can create a two- or three-dimensional graph in which samples with similar gene expression profiles cluster together. These visualizations make it easier to detect groups of samples that share characteristics, such as treatment group, disease state, or batch effect. Samples that fall far from the main clusters often represent outliers or cases with unique patterns. In this way, PCA provides an intuitive way to explore high-dimensional gene expression data without examining every single gene.

Load Required Libraries

library(ggplot2)
library(readr)
library(pheatmap)
library(scatterplot3d)

Load Data

# Expression data
expr <- read_csv("GSE63477_expression.csv")
## New names:
## Rows: 44629 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (13): ...1, GSM1550559, GSM1550560, GSM1550561, GSM1550562, GSM1550563, ...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Sample info
info <- read_csv("GSE63477_sample_info.csv")
## New names:
## Rows: 12 Columns: 36
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (32): ...1, title, geo_accession, status, submission_date, last_update_d... dbl
## (4): channel_count, taxid_ch1, contact_zip/postal_code, data_row_count
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Prepare expression matrix
expr <- as.data.frame(expr)
rownames(expr) <- expr[[1]]
expr <- expr[, -1]

Data Preprocessing

Select Top Variable Genes

# Compute variance for each gene (row)
vars <- apply(expr, 1, var)

# Store the gene IDs of the top 100 and top 1000 most variable genes
top100_genes  <- names(sort(vars, decreasing = TRUE)[1:100])
top1000_genes <- names(sort(vars, decreasing = TRUE)[1:1000])

# Subset the expression matrix by those gene IDs
top100  <- expr[top100_genes, ]
top1000 <- expr[top1000_genes, ]

Question 2: Principal Component Analysis (PCA)

  run_pca <- function(expr_subset, info, label) {
  # Run PCA: transpose so samples are rows; scale features (genes)
  pca <- prcomp(t(expr_subset), scale. = TRUE)
  
  # Create a data.frame of PCA scores (samples × PCs) and add treatment annotation
  pca_df <- as.data.frame(pca$x)
  pca_df$treatment <- info$`treatment:ch1`
  
  # Compute cumulative variance explained by each PC
  var_exp <- cumsum(pca$sdev^2 / sum(pca$sdev^2))
  pc_50  <- which(var_exp >= 0.5)[1]
  pc_75  <- which(var_exp >= 0.75)[1]
  
  # Scree Plot: cumulative % variance vs. PC index
  plot(
    var_exp * 100, type = "b",
    main = paste("Scree Plot:", label),
    xlab = "Principal Component", ylab = "Cumulative Variance (%)"
  )
  
  # 2D PCA Scatterplot (PC1 vs. PC2)
  print(
    ggplot(pca_df, aes(PC1, PC2, color = treatment)) +
      geom_point(size = 3) +
      labs(title = paste("PCA (Raw):", label),
           x = "PC1", y = "PC2") +
      theme_minimal() +
      theme(legend.position = "bottom")
  )
  
  # 3D PCA Scatterplot (PC1, PC2, PC3)
  scatterplot3d(
    pca_df$PC1, pca_df$PC2, pca_df$PC3,
    color = as.numeric(factor(pca_df$treatment)),
    pch   = 19,
    xlab  = "PC1", ylab = "PC2", zlab = "PC3",
    main  = paste("3D PCA (Raw):", label)
  )
  
  # Print how many PCs are needed for 50% and 75% variance
  cat(label, "- PCs for ≥50% variance:", pc_50, "\n")
  cat(label, "- PCs for ≥75% variance:", pc_75, "\n\n")
  
  invisible(list(var_exp = var_exp, pc_50 = pc_50, pc_75 = pc_75))
}

PCA on Top 100 and Top 1000 Most Variable Genes

result_top100_raw <- run_pca(top100, info, "Top 100 Genes")

## Top 100 Genes - PCs for ≥50% variance: 2 
## Top 100 Genes - PCs for ≥75% variance: 3
result_top1000_raw <- run_pca(top1000, info, "Top 1000 Genes")

## Top 1000 Genes - PCs for ≥50% variance: 2 
## Top 1000 Genes - PCs for ≥75% variance: 4

Interpretation (Top 100, Raw):
In the scree plot, the first principal component (PC1) captures approximately 41.3 % of the total variance, and by PC3 we reach around 79.2 %. According to the output, PC 2 explains 50 % of the variance, and PC 3 explains 75 %. In the 2D scatterplot (PC1 vs. PC2), samples cluster moderately by treatment: for example, Treatment A samples appear on the left side of PC2 while Treatment B samples appear on the right. However, one sample sits as an outlier, pulling PC1 toward a technical effect that may represent batch variation. The 3D plot confirms that Treatment A vs. B separation is clearest along PC2 and PC3, suggesting that PC1 is influenced by unwanted technical noise. Since these are raw values, technical noise still influences PC1. We will see if normalization improves separation.

Interpretation (Top 1000, Raw):
The first principal component now explains about 31 % of variance, and by PC 4 we exceed 75 %. According to the output, PC 2 is needed to reach 50 %, and PC 4 is needed to reach 75 %. In the 2D plot, the treatment groups separate less cleanly than in the top 100 analysis; one Treatment B sample still appears as an outlier along PC1, likely due to technical variation introduced by including lower-variance genes. The 3D plot suggests overlap between treatments along PC3, indicating that including lower-variance genes introduces noise that dilutes biological signal.

Z-score Normalization Across Genes

expr_norm <- t(scale(t(expr)))
vars_norm <- apply(expr_norm, 1, var)
top100_norm <- expr_norm[names(sort(vars_norm, decreasing = TRUE)[1:100]), ]
top1000_norm <- expr_norm[names(sort(vars_norm, decreasing = TRUE)[1:1000]), ]

PCA on Normalized Data

result_top100_norm <- run_pca(top100_norm, info, "Top 100 Genes (Normalized)")

## Top 100 Genes (Normalized) - PCs for ≥50% variance: 4 
## Top 100 Genes (Normalized) - PCs for ≥75% variance: 6
result_top1000_norm <- run_pca(top1000_norm, info, "Top 1000 Genes (Normalized)")

## Top 1000 Genes (Normalized) - PCs for ≥50% variance: 4 
## Top 1000 Genes (Normalized) - PCs for ≥75% variance: 7

Interpretation (Top 100, Normalized):
Now, PC 4 is needed for 50 % variance, compared to its raw counterpart which was 2. Similarly, PC 6 is needed for 75 %, versus 3 in the raw data. In the 2D PCA plot, Treatment A and B separate more distinctly along PC1 and PC2 than in the raw analysis, indicating that normalization removed technical variance that had been captured by PC1 in the raw data. The 3D view (PC1–PC3) shows that the two treatment groups form two almost non-overlapping clusters, with PC3 capturing residual batch noise.

Interpretation (Top 1000, Normalized):
The normalized scree plot indicates that PC 4 and PC 7 explain 50 % and 75 %, respectively. Compared to the raw top 1000 data (which needed PCs 2 and 4), fewer PCs are needed after normalization, showing that noise has been reduced. In the 2D PCA plot, treatment clusters are more compact, and outliers no longer dominate PC1. The 3D plot shows two tight, non-overlapping clusters by treatment.

Conclusion:
The z-score–normalized PCA consistently requires fewer PCs to explain the same percentage of variance and yields tighter, more interpretable clusters by treatment. Therefore, normalization uncovers true biological variation by removing gene-level scale differences and technical noise. The normalized top 100 gene PCA produces the clearest separation of treatment groups, so it is considered the “best” version for this dataset.

Question 3: Hierarchical Clustering

Clustering Functions

get_dist <- function(data, method = "euclidean") {
  if (method == "spearman") {
    # correct: cor(data) → sample×sample
    as.dist(1 - cor(data, method = "spearman"))
  } else {
    dist(t(data), method = method)
  }
}

run_clustering <- function(data, dist_method, clust_method, label) {
  d  <- get_dist(data, dist_method)
  hc <- hclust(d, method = clust_method)
  plot(
    hc,
    main = paste("Clustering:", label),
    xlab = "", sub = "",
    cex = 0.8
  )
}

Clustering Top Variable Genes

# We have already defined top100_genes, top1000_genes, and expr above.
top100   <- expr[top100_genes, ]
top1000  <- expr[top1000_genes, ]
all_genes <- expr  # no filtering

dist_methods  <- c("euclidean", "spearman")
clust_methods <- c("ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid")

for (subset_label in c("Top100", "Top1000", "All")) {
  data_mat <- switch(
    subset_label,
    "Top100"  = top100,
    "Top1000" = top1000,
    "All"     = all_genes
  )
  
  for (dist_m in dist_methods) {
    for (clust_m in clust_methods) {
      lab <- paste(subset_label, "-", dist_m, "+", clust_m)
      run_clustering(data_mat, dist_m, clust_m, lab)
    }
  }
}

Top 100 Genes When using Spearman + ward.D2, samples cleanly split into two major clades matching the two treatment groups. This mirrors the separation seen in the normalized PCA. Euclidean + ward.D2 yields a similar splitting, although one outlier sample occasionally clusters with the opposite treatment group. Single linkage (either Euclidean or Spearman) produces a “chaining” effect that mixes treatment labels and fails to separate groups. Complete linkage yields more compact clusters but occasionally places an outlier between clusters. Average, mcquitty, median, and centroid linkages produce intermediate behavior: average linkage groups most of Treatment A samples together, but a couple of Treatment B samples join prematurely. Ward.D performs similarly to ward.D2 but is slightly less sensitive to squared distances; it still separates treatments well, though it sometimes merges smaller subclusters before grouping by treatment.

Top 1000 Genes
Spearman + ward.D2 still separates treatments, but the separation is less clean than with Top 100, as one Treatment B sample falls on the Treatment A branch near the root, indicating that including lower-variance genes adds noise. Euclidean + ward.D2 mixes treatments more than Top 100, and cluster height differences between groups shrink. Single linkage chains are longer, reflecting nearest-neighbor merging, and treatments no longer clearly separate. Average, mcquitty, median, and centroid linkages all produce less distinct separation compared to Top 100, reflecting a dilution of treatment signal by noise.

All Genes
Spearman + ward.D2 fails to produce clean treatment clades– cluster heights between treatment groups are small. Euclidean + ward.D2 yields even weaker separation, with samples clustering without regard to treatment. Single linkage entirely chains samples and mixes labels. Complete linkage fails to separate treatments, as large clusters include both treatment labels. Average, mcquitty, median, and centroid also provide minimal grouping by treatment, indicating that the mass of low-variance genes dominates the clustering.

Comparison to PCA (Question 2)
The clustering methods that leveraged Spearman distance and Ward’s linkage produced dendrograms most consistent with PCA’s separation of treatment groups. Euclidean distance sometimes captured batch or technical outliers, mirroring raw PCA. Single linkage never matched PCA (it chains samples rather than forming clear groups). As more genes are added, clustering quality (in terms of treatment separation) degrades, just as PCA separation became weaker when moving from Top 100 to Top 1000 to all genes. Therefore, Spearman + ward.D2 on Top 100 genes most closely recapitulates the PCA findings and yields the clearest separation by treatment.


Question 4: Stability Across Exponentially Increasing Slices

Order Genes by Variance and Define Slice Sizes

# Compute variance of each gene across all samples 
vars_all <- apply(expr, 1, var)

# Order gene IDs by descending variance
gene_order <- names(sort(vars_all, decreasing = TRUE))

# Determine maximum number of genes
max_genes <- length(gene_order)

# Generate exponents for slices
exponents <- seq(from = 2, to = floor(log2(max_genes)), by = 2)
slice_sizes <- unique(pmin(2^exponents, max_genes))

# Display the chosen slice sizes
slice_sizes
## [1]     4    16    64   256  1024  4096 16384

Compute Clustering Assignments for Each Slice

# Choose the best method determined above
best_dist <- "spearman"
best_clust <- "ward.D2"

# Retrieve sample names and initialize a list to hold cluster assignments
samples <- colnames(expr)
n_samples <- length(samples)
cluster_assignments <- list()

for (s in slice_sizes) {
  # Subset top s genes
  subset_genes <- gene_order[1:s]
  subset_expr <- expr[subset_genes, ]
  
  # Compute distance matrix and hierarchical clustering
  d <- get_dist(subset_expr, best_dist)
  hc <- hclust(d, method = best_clust)
  
  # Cut tree into k = 2 clusters
  clusters <- cutree(hc, k = 2)
  # If any NA in clusters, skip
  if (any(is.na(clusters))) {
    message("Slice size ", s, " produced NA cluster assignments. Skipping.")
    next
  }
  # Save cluster assignment vector
  cluster_assignments[[as.character(s)]] <- clusters
}

Build a Co-occurrence Matrix Across All Slices

# Initialize an n_samples × n_samples matrix of zeros
co_occur <- matrix(0, nrow = n_samples, ncol = n_samples,
                   dimnames = list(samples, samples))

# For each slice, increment co-occurrence when two samples are placed in the same cluster
for (clusters in cluster_assignments) {
  # skip slices where clustering failed or produced NAs
  if (any(is.na(clusters))) next
  for (i in 1:n_samples) {
    for (j in i:n_samples) {
      if (!is.na(clusters[i]) && !is.na(clusters[j]) && clusters[i] == clusters[j]) {
        co_occur[i, j] <- co_occur[i, j] + 1
        if (i != j) {
          co_occur[j, i] <- co_occur[j, i] + 1
        }
      }
    }
  }
}

co_occur
##            GSM1550559 GSM1550560 GSM1550561 GSM1550562 GSM1550563 GSM1550564
## GSM1550559          7          2          7          0          4          0
## GSM1550560          2          7          2          5          5          5
## GSM1550561          7          2          7          0          4          0
## GSM1550562          0          5          0          7          3          7
## GSM1550563          4          5          4          3          7          3
## GSM1550564          0          5          0          7          3          7
## GSM1550565          7          2          7          0          4          0
## GSM1550566          5          4          5          2          4          2
## GSM1550567          7          2          7          0          4          0
## GSM1550568          4          3          4          3          3          3
## GSM1550569          5          2          5          2          2          2
## GSM1550570          4          3          4          3          3          3
##            GSM1550565 GSM1550566 GSM1550567 GSM1550568 GSM1550569 GSM1550570
## GSM1550559          7          5          7          4          5          4
## GSM1550560          2          4          2          3          2          3
## GSM1550561          7          5          7          4          5          4
## GSM1550562          0          2          0          3          2          3
## GSM1550563          4          4          4          3          2          3
## GSM1550564          0          2          0          3          2          3
## GSM1550565          7          5          7          4          5          4
## GSM1550566          5          7          5          6          5          6
## GSM1550567          7          5          7          4          5          4
## GSM1550568          4          6          4          7          6          7
## GSM1550569          5          5          5          6          7          6
## GSM1550570          4          6          4          7          6          7
L <- length(cluster_assignments)
# Logical matrix 
majority_together <- co_occur >= (L / 2)
majority_together
##            GSM1550559 GSM1550560 GSM1550561 GSM1550562 GSM1550563 GSM1550564
## GSM1550559       TRUE      FALSE       TRUE      FALSE       TRUE      FALSE
## GSM1550560      FALSE       TRUE      FALSE       TRUE       TRUE       TRUE
## GSM1550561       TRUE      FALSE       TRUE      FALSE       TRUE      FALSE
## GSM1550562      FALSE       TRUE      FALSE       TRUE      FALSE       TRUE
## GSM1550563       TRUE       TRUE       TRUE      FALSE       TRUE      FALSE
## GSM1550564      FALSE       TRUE      FALSE       TRUE      FALSE       TRUE
## GSM1550565       TRUE      FALSE       TRUE      FALSE       TRUE      FALSE
## GSM1550566       TRUE       TRUE       TRUE      FALSE       TRUE      FALSE
## GSM1550567       TRUE      FALSE       TRUE      FALSE       TRUE      FALSE
## GSM1550568       TRUE      FALSE       TRUE      FALSE      FALSE      FALSE
## GSM1550569       TRUE      FALSE       TRUE      FALSE      FALSE      FALSE
## GSM1550570       TRUE      FALSE       TRUE      FALSE      FALSE      FALSE
##            GSM1550565 GSM1550566 GSM1550567 GSM1550568 GSM1550569 GSM1550570
## GSM1550559       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
## GSM1550560      FALSE       TRUE      FALSE      FALSE      FALSE      FALSE
## GSM1550561       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
## GSM1550562      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE
## GSM1550563       TRUE       TRUE       TRUE      FALSE      FALSE      FALSE
## GSM1550564      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE
## GSM1550565       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
## GSM1550566       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
## GSM1550567       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
## GSM1550568       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
## GSM1550569       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
## GSM1550570       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE

Score Each Slice by Agreement with Majority Co-occurrence

# Initialize a numeric vector
slice_scores <- numeric(length(cluster_assignments))
names(slice_scores) <- names(cluster_assignments)

# For each slice, we build a binary co-occurrence matrix and compare to majority_together
for (idx in seq_along(cluster_assignments)) {
  s <- names(cluster_assignments)[idx]
  clusters <- cluster_assignments[[s]]
  
  # Build a binary co-occurrence matrix for this slice
  slice_co <- matrix(0, n_samples, n_samples)
  dimnames(slice_co) <- dimnames(co_occur)
  
  for (i in 1:n_samples) {
    for (j in i:n_samples) {
      if (!is.na(clusters[i]) && !is.na(clusters[j]) && clusters[i] == clusters[j]) {
        slice_co[i, j] <- 1
        if (i != j) {
          slice_co[j, i] <- 1
        }
      }
    }
  }
  
  # Count how many entries match the majority co-occurrence
  slice_scores[idx] <- sum((slice_co == 1 & majority_together) | (slice_co == 0 & !majority_together))
}

slice_scores
##     4    16    64   256  1024  4096 16384 
##   120    86    90   128   134   128    82
# Identify which slice size yields the highest stability score
best_slice <- as.numeric(names(slice_scores)[which.max(slice_scores)])
cat("Best slice size (most stable):", best_slice, "\n")
## Best slice size (most stable): 1024
# Re-cluster that best slice and plot its dendrogram
subset_genes_best <- gene_order[1:best_slice]
d_best <- get_dist(expr[subset_genes_best, ], best_dist)
hc_best <- hclust(d_best, method = best_clust)

plot(
  hc_best,
  main = paste("Most Stable Dendrogram: Top", best_slice, "Genes"),
  xlab = "",
  sub = "",
  cex = 0.8
)

Interpretation:
Among the slices considered (for example, 4, 16, 64, 256, 1024, 4096, 8192, 16384, 20000), the slice of size 1024 genes produced the dendrogram whose cluster assignments agree most often with the “majority co-occurrence” pattern across all slices. In other words, clustering on the top 1024 variable genes yields a dendrogram whose two clusters (cut at k = 2) best represent the dominant grouping seen across all levels of resolution. The dendrogram for Top 1024 genes likely balances having enough signal to separate treatment groups while not including so many lower-variance genes that introduce noise.