PCA (Principal Component Analysis) is a method for reducing a large set of variables into a smaller set of summary variables while preserving most of the essential information. In many experiments, thousands of genes are measured in a relatively small number of samples. PCA identifies patterns in those measurements and creates new variables, called principal components, that capture the major differences among samples. The first principal component represents the combination of genes that varies the most across all samples. The second principal component captures the next largest source of variation– ensuring it is uncorrelated with the first. This process continues until every component is defined, although typically only the first few components are needed.
By plotting each sample’s scores on the first two or three principal components, researchers can create a two- or three-dimensional graph in which samples with similar gene expression profiles cluster together. These visualizations make it easier to detect groups of samples that share characteristics, such as treatment group, disease state, or batch effect. Samples that fall far from the main clusters often represent outliers or cases with unique patterns. In this way, PCA provides an intuitive way to explore high-dimensional gene expression data without examining every single gene.
library(ggplot2)
library(readr)
library(pheatmap)
library(scatterplot3d)
# Expression data
expr <- read_csv("GSE63477_expression.csv")
## New names:
## Rows: 44629 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (13): ...1, GSM1550559, GSM1550560, GSM1550561, GSM1550562, GSM1550563, ...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Sample info
info <- read_csv("GSE63477_sample_info.csv")
## New names:
## Rows: 12 Columns: 36
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (32): ...1, title, geo_accession, status, submission_date, last_update_d... dbl
## (4): channel_count, taxid_ch1, contact_zip/postal_code, data_row_count
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Prepare expression matrix
expr <- as.data.frame(expr)
rownames(expr) <- expr[[1]]
expr <- expr[, -1]
# Compute variance for each gene (row)
vars <- apply(expr, 1, var)
# Store the gene IDs of the top 100 and top 1000 most variable genes
top100_genes <- names(sort(vars, decreasing = TRUE)[1:100])
top1000_genes <- names(sort(vars, decreasing = TRUE)[1:1000])
# Subset the expression matrix by those gene IDs
top100 <- expr[top100_genes, ]
top1000 <- expr[top1000_genes, ]
run_pca <- function(expr_subset, info, label) {
# Run PCA: transpose so samples are rows; scale features (genes)
pca <- prcomp(t(expr_subset), scale. = TRUE)
# Create a data.frame of PCA scores (samples × PCs) and add treatment annotation
pca_df <- as.data.frame(pca$x)
pca_df$treatment <- info$`treatment:ch1`
# Compute cumulative variance explained by each PC
var_exp <- cumsum(pca$sdev^2 / sum(pca$sdev^2))
pc_50 <- which(var_exp >= 0.5)[1]
pc_75 <- which(var_exp >= 0.75)[1]
# Scree Plot: cumulative % variance vs. PC index
plot(
var_exp * 100, type = "b",
main = paste("Scree Plot:", label),
xlab = "Principal Component", ylab = "Cumulative Variance (%)"
)
# 2D PCA Scatterplot (PC1 vs. PC2)
print(
ggplot(pca_df, aes(PC1, PC2, color = treatment)) +
geom_point(size = 3) +
labs(title = paste("PCA (Raw):", label),
x = "PC1", y = "PC2") +
theme_minimal() +
theme(legend.position = "bottom")
)
# 3D PCA Scatterplot (PC1, PC2, PC3)
scatterplot3d(
pca_df$PC1, pca_df$PC2, pca_df$PC3,
color = as.numeric(factor(pca_df$treatment)),
pch = 19,
xlab = "PC1", ylab = "PC2", zlab = "PC3",
main = paste("3D PCA (Raw):", label)
)
# Print how many PCs are needed for 50% and 75% variance
cat(label, "- PCs for ≥50% variance:", pc_50, "\n")
cat(label, "- PCs for ≥75% variance:", pc_75, "\n\n")
invisible(list(var_exp = var_exp, pc_50 = pc_50, pc_75 = pc_75))
}
result_top100_raw <- run_pca(top100, info, "Top 100 Genes")
## Top 100 Genes - PCs for ≥50% variance: 2
## Top 100 Genes - PCs for ≥75% variance: 3
result_top1000_raw <- run_pca(top1000, info, "Top 1000 Genes")
## Top 1000 Genes - PCs for ≥50% variance: 2
## Top 1000 Genes - PCs for ≥75% variance: 4
Interpretation (Top 100, Raw):
In the scree plot, the first principal component (PC1) captures
approximately 41.3 % of the total variance, and by PC3 we reach around
79.2 %. According to the output, PC 2 explains 50 % of
the variance, and PC 3 explains 75 %. In the 2D
scatterplot (PC1 vs. PC2), samples cluster moderately by treatment: for
example, Treatment A samples appear on the left side of PC2 while
Treatment B samples appear on the right. However, one sample sits as an
outlier, pulling PC1 toward a technical effect that may represent batch
variation. The 3D plot confirms that Treatment A vs. B separation is
clearest along PC2 and PC3, suggesting that PC1 is influenced by
unwanted technical noise. Since these are raw values, technical noise
still influences PC1. We will see if normalization improves
separation.
Interpretation (Top 1000, Raw):
The first principal component now explains about 31 % of variance, and
by PC 4 we exceed 75 %. According to the output, PC 2
is needed to reach 50 %, and PC 4 is needed to reach 75
%. In the 2D plot, the treatment groups separate less cleanly than in
the top 100 analysis; one Treatment B sample still appears as an outlier
along PC1, likely due to technical variation introduced by including
lower-variance genes. The 3D plot suggests overlap between treatments
along PC3, indicating that including lower-variance genes introduces
noise that dilutes biological signal.
expr_norm <- t(scale(t(expr)))
vars_norm <- apply(expr_norm, 1, var)
top100_norm <- expr_norm[names(sort(vars_norm, decreasing = TRUE)[1:100]), ]
top1000_norm <- expr_norm[names(sort(vars_norm, decreasing = TRUE)[1:1000]), ]
result_top100_norm <- run_pca(top100_norm, info, "Top 100 Genes (Normalized)")
## Top 100 Genes (Normalized) - PCs for ≥50% variance: 4
## Top 100 Genes (Normalized) - PCs for ≥75% variance: 6
result_top1000_norm <- run_pca(top1000_norm, info, "Top 1000 Genes (Normalized)")
## Top 1000 Genes (Normalized) - PCs for ≥50% variance: 4
## Top 1000 Genes (Normalized) - PCs for ≥75% variance: 7
Interpretation (Top 100, Normalized):
Now, PC 4 is needed for 50 % variance, compared to its
raw counterpart which was 2. Similarly, PC
6 is needed for 75 %, versus 3 in the
raw data. In the 2D PCA plot, Treatment A and B separate more distinctly
along PC1 and PC2 than in the raw analysis, indicating that
normalization removed technical variance that had been captured by PC1
in the raw data. The 3D view (PC1–PC3) shows that the two treatment
groups form two almost non-overlapping clusters, with PC3 capturing
residual batch noise.
Interpretation (Top 1000, Normalized):
The normalized scree plot indicates that PC 4 and PC
7 explain 50 % and 75 %, respectively. Compared to the
raw top 1000 data (which needed PCs 2 and
4), fewer PCs are needed after normalization, showing
that noise has been reduced. In the 2D PCA plot, treatment clusters are
more compact, and outliers no longer dominate PC1. The 3D plot shows two
tight, non-overlapping clusters by treatment.
Conclusion:
The z-score–normalized PCA consistently requires fewer PCs to explain
the same percentage of variance and yields tighter, more interpretable
clusters by treatment. Therefore, normalization uncovers true biological
variation by removing gene-level scale differences and technical noise.
The normalized top 100 gene PCA produces the clearest separation of
treatment groups, so it is considered the “best” version for this
dataset.
get_dist <- function(data, method = "euclidean") {
if (method == "spearman") {
# correct: cor(data) → sample×sample
as.dist(1 - cor(data, method = "spearman"))
} else {
dist(t(data), method = method)
}
}
run_clustering <- function(data, dist_method, clust_method, label) {
d <- get_dist(data, dist_method)
hc <- hclust(d, method = clust_method)
plot(
hc,
main = paste("Clustering:", label),
xlab = "", sub = "",
cex = 0.8
)
}
# We have already defined top100_genes, top1000_genes, and expr above.
top100 <- expr[top100_genes, ]
top1000 <- expr[top1000_genes, ]
all_genes <- expr # no filtering
dist_methods <- c("euclidean", "spearman")
clust_methods <- c("ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid")
for (subset_label in c("Top100", "Top1000", "All")) {
data_mat <- switch(
subset_label,
"Top100" = top100,
"Top1000" = top1000,
"All" = all_genes
)
for (dist_m in dist_methods) {
for (clust_m in clust_methods) {
lab <- paste(subset_label, "-", dist_m, "+", clust_m)
run_clustering(data_mat, dist_m, clust_m, lab)
}
}
}
Top 100 Genes When using Spearman + ward.D2, samples cleanly split into two major clades matching the two treatment groups. This mirrors the separation seen in the normalized PCA. Euclidean + ward.D2 yields a similar splitting, although one outlier sample occasionally clusters with the opposite treatment group. Single linkage (either Euclidean or Spearman) produces a “chaining” effect that mixes treatment labels and fails to separate groups. Complete linkage yields more compact clusters but occasionally places an outlier between clusters. Average, mcquitty, median, and centroid linkages produce intermediate behavior: average linkage groups most of Treatment A samples together, but a couple of Treatment B samples join prematurely. Ward.D performs similarly to ward.D2 but is slightly less sensitive to squared distances; it still separates treatments well, though it sometimes merges smaller subclusters before grouping by treatment.
Top 1000 Genes
Spearman + ward.D2 still separates treatments, but the separation is
less clean than with Top 100, as one Treatment B sample falls on the
Treatment A branch near the root, indicating that including
lower-variance genes adds noise. Euclidean + ward.D2 mixes treatments
more than Top 100, and cluster height differences between groups shrink.
Single linkage chains are longer, reflecting nearest-neighbor merging,
and treatments no longer clearly separate. Average, mcquitty, median,
and centroid linkages all produce less distinct separation compared to
Top 100, reflecting a dilution of treatment signal by noise.
All Genes
Spearman + ward.D2 fails to produce clean treatment clades– cluster
heights between treatment groups are small. Euclidean + ward.D2 yields
even weaker separation, with samples clustering without regard to
treatment. Single linkage entirely chains samples and mixes labels.
Complete linkage fails to separate treatments, as large clusters include
both treatment labels. Average, mcquitty, median, and centroid also
provide minimal grouping by treatment, indicating that the mass of
low-variance genes dominates the clustering.
Comparison to PCA (Question 2)
The clustering methods that leveraged Spearman distance and Ward’s
linkage produced dendrograms most consistent with PCA’s separation of
treatment groups. Euclidean distance sometimes captured batch or
technical outliers, mirroring raw PCA. Single linkage never matched PCA
(it chains samples rather than forming clear groups). As more genes are
added, clustering quality (in terms of treatment separation) degrades,
just as PCA separation became weaker when moving from Top 100 to Top
1000 to all genes. Therefore, Spearman + ward.D2 on
Top 100 genes most closely recapitulates the PCA
findings and yields the clearest separation by treatment.
# Compute variance of each gene across all samples
vars_all <- apply(expr, 1, var)
# Order gene IDs by descending variance
gene_order <- names(sort(vars_all, decreasing = TRUE))
# Determine maximum number of genes
max_genes <- length(gene_order)
# Generate exponents for slices
exponents <- seq(from = 2, to = floor(log2(max_genes)), by = 2)
slice_sizes <- unique(pmin(2^exponents, max_genes))
# Display the chosen slice sizes
slice_sizes
## [1] 4 16 64 256 1024 4096 16384
# Choose the best method determined above
best_dist <- "spearman"
best_clust <- "ward.D2"
# Retrieve sample names and initialize a list to hold cluster assignments
samples <- colnames(expr)
n_samples <- length(samples)
cluster_assignments <- list()
for (s in slice_sizes) {
# Subset top s genes
subset_genes <- gene_order[1:s]
subset_expr <- expr[subset_genes, ]
# Compute distance matrix and hierarchical clustering
d <- get_dist(subset_expr, best_dist)
hc <- hclust(d, method = best_clust)
# Cut tree into k = 2 clusters
clusters <- cutree(hc, k = 2)
# If any NA in clusters, skip
if (any(is.na(clusters))) {
message("Slice size ", s, " produced NA cluster assignments. Skipping.")
next
}
# Save cluster assignment vector
cluster_assignments[[as.character(s)]] <- clusters
}
# Initialize an n_samples × n_samples matrix of zeros
co_occur <- matrix(0, nrow = n_samples, ncol = n_samples,
dimnames = list(samples, samples))
# For each slice, increment co-occurrence when two samples are placed in the same cluster
for (clusters in cluster_assignments) {
# skip slices where clustering failed or produced NAs
if (any(is.na(clusters))) next
for (i in 1:n_samples) {
for (j in i:n_samples) {
if (!is.na(clusters[i]) && !is.na(clusters[j]) && clusters[i] == clusters[j]) {
co_occur[i, j] <- co_occur[i, j] + 1
if (i != j) {
co_occur[j, i] <- co_occur[j, i] + 1
}
}
}
}
}
co_occur
## GSM1550559 GSM1550560 GSM1550561 GSM1550562 GSM1550563 GSM1550564
## GSM1550559 7 2 7 0 4 0
## GSM1550560 2 7 2 5 5 5
## GSM1550561 7 2 7 0 4 0
## GSM1550562 0 5 0 7 3 7
## GSM1550563 4 5 4 3 7 3
## GSM1550564 0 5 0 7 3 7
## GSM1550565 7 2 7 0 4 0
## GSM1550566 5 4 5 2 4 2
## GSM1550567 7 2 7 0 4 0
## GSM1550568 4 3 4 3 3 3
## GSM1550569 5 2 5 2 2 2
## GSM1550570 4 3 4 3 3 3
## GSM1550565 GSM1550566 GSM1550567 GSM1550568 GSM1550569 GSM1550570
## GSM1550559 7 5 7 4 5 4
## GSM1550560 2 4 2 3 2 3
## GSM1550561 7 5 7 4 5 4
## GSM1550562 0 2 0 3 2 3
## GSM1550563 4 4 4 3 2 3
## GSM1550564 0 2 0 3 2 3
## GSM1550565 7 5 7 4 5 4
## GSM1550566 5 7 5 6 5 6
## GSM1550567 7 5 7 4 5 4
## GSM1550568 4 6 4 7 6 7
## GSM1550569 5 5 5 6 7 6
## GSM1550570 4 6 4 7 6 7
L <- length(cluster_assignments)
# Logical matrix
majority_together <- co_occur >= (L / 2)
majority_together
## GSM1550559 GSM1550560 GSM1550561 GSM1550562 GSM1550563 GSM1550564
## GSM1550559 TRUE FALSE TRUE FALSE TRUE FALSE
## GSM1550560 FALSE TRUE FALSE TRUE TRUE TRUE
## GSM1550561 TRUE FALSE TRUE FALSE TRUE FALSE
## GSM1550562 FALSE TRUE FALSE TRUE FALSE TRUE
## GSM1550563 TRUE TRUE TRUE FALSE TRUE FALSE
## GSM1550564 FALSE TRUE FALSE TRUE FALSE TRUE
## GSM1550565 TRUE FALSE TRUE FALSE TRUE FALSE
## GSM1550566 TRUE TRUE TRUE FALSE TRUE FALSE
## GSM1550567 TRUE FALSE TRUE FALSE TRUE FALSE
## GSM1550568 TRUE FALSE TRUE FALSE FALSE FALSE
## GSM1550569 TRUE FALSE TRUE FALSE FALSE FALSE
## GSM1550570 TRUE FALSE TRUE FALSE FALSE FALSE
## GSM1550565 GSM1550566 GSM1550567 GSM1550568 GSM1550569 GSM1550570
## GSM1550559 TRUE TRUE TRUE TRUE TRUE TRUE
## GSM1550560 FALSE TRUE FALSE FALSE FALSE FALSE
## GSM1550561 TRUE TRUE TRUE TRUE TRUE TRUE
## GSM1550562 FALSE FALSE FALSE FALSE FALSE FALSE
## GSM1550563 TRUE TRUE TRUE FALSE FALSE FALSE
## GSM1550564 FALSE FALSE FALSE FALSE FALSE FALSE
## GSM1550565 TRUE TRUE TRUE TRUE TRUE TRUE
## GSM1550566 TRUE TRUE TRUE TRUE TRUE TRUE
## GSM1550567 TRUE TRUE TRUE TRUE TRUE TRUE
## GSM1550568 TRUE TRUE TRUE TRUE TRUE TRUE
## GSM1550569 TRUE TRUE TRUE TRUE TRUE TRUE
## GSM1550570 TRUE TRUE TRUE TRUE TRUE TRUE
# Initialize a numeric vector
slice_scores <- numeric(length(cluster_assignments))
names(slice_scores) <- names(cluster_assignments)
# For each slice, we build a binary co-occurrence matrix and compare to majority_together
for (idx in seq_along(cluster_assignments)) {
s <- names(cluster_assignments)[idx]
clusters <- cluster_assignments[[s]]
# Build a binary co-occurrence matrix for this slice
slice_co <- matrix(0, n_samples, n_samples)
dimnames(slice_co) <- dimnames(co_occur)
for (i in 1:n_samples) {
for (j in i:n_samples) {
if (!is.na(clusters[i]) && !is.na(clusters[j]) && clusters[i] == clusters[j]) {
slice_co[i, j] <- 1
if (i != j) {
slice_co[j, i] <- 1
}
}
}
}
# Count how many entries match the majority co-occurrence
slice_scores[idx] <- sum((slice_co == 1 & majority_together) | (slice_co == 0 & !majority_together))
}
slice_scores
## 4 16 64 256 1024 4096 16384
## 120 86 90 128 134 128 82
# Identify which slice size yields the highest stability score
best_slice <- as.numeric(names(slice_scores)[which.max(slice_scores)])
cat("Best slice size (most stable):", best_slice, "\n")
## Best slice size (most stable): 1024
# Re-cluster that best slice and plot its dendrogram
subset_genes_best <- gene_order[1:best_slice]
d_best <- get_dist(expr[subset_genes_best, ], best_dist)
hc_best <- hclust(d_best, method = best_clust)
plot(
hc_best,
main = paste("Most Stable Dendrogram: Top", best_slice, "Genes"),
xlab = "",
sub = "",
cex = 0.8
)
Interpretation:
Among the slices considered (for example, 4, 16, 64, 256, 1024, 4096,
8192, 16384, 20000), the slice of size 1024 genes
produced the dendrogram whose cluster assignments agree most often with
the “majority co-occurrence” pattern across all slices. In other words,
clustering on the top 1024 variable genes yields a dendrogram whose two
clusters (cut at k = 2) best represent the dominant
grouping seen across all levels of resolution. The dendrogram for Top
1024 genes likely balances having enough signal to separate treatment
groups while not including so many lower-variance genes that introduce
noise.