Breast cancer is one of the most commonly diagnosed cancers worldwide and remains a major public health concern. Early detection plays a critical role in improving patient outcomes, as timely diagnosis significantly increases the chances of successful treatment. Because of this, it is a well-suited and meaningful domain for testing out whether unsupervised learning techniques like clustering and dimension reduction perform well.
# Install packages (uncomment if needed)
# install.packages("factoextra")
# install.packages("cluster")
# install.packages("fpc")
# install.packages("clustertend")
# install.packages("ClusterR")
# install.packages("flexclust")
# install.packages("NbClust")
# install.packages("tidyverse")
# install.packages("dendextend")
# install.packages("dbscan")
# install.packages("smacof")
# install.packages("ggfortify")
# install.packages("corrplot")
# install.packages("ape")
# install.packages("psych")
# Load required packages
library(factoextra)
library(psych)
library(corrplot)
library(smacof)
library(ggfortify)
library(cluster)
library(fpc)
library(clustertend)
library(ClusterR)
library(flexclust)
library(NbClust)
library(tidyverse)
library(dendextend)
library(dbscan)
library(ape)
The dataset chosen for this project can be found at: Dataset Link
Each observation in the dataset represents a breast cell sample described using the following features:
MALIG: Diagnostic indicator used only for interpretation, where −1 denotes benign tumors and 1 denotes malignant tumors.
Clump_Thick: Measures the thickness of cell clumps, with higher values indicating denser and more irregular cell groupings.
Uniform_Cell_Size: Assesses the consistency of cell sizes, where higher values reflect greater variability, often associated with malignancy.
Uniform_Cell_Shape: Evaluates uniformity in cell shape, with higher scores indicating irregular and abnormal cell morphology.
Marginal_Adhes: Represents how strongly cells adhere to one another, with lower adhesion commonly observed in malignant cells.
Epit_Size: Measures the size of epithelial cells, where larger values may indicate abnormal cell growth.
Bare_Nuclei: Counts the presence of bare nuclei, which are more frequently found in malignant samples.
Bland_Chrom: Describes the texture and uniformity of chromatin in the nucleus, with higher values indicating irregular chromatin patterns.
Norm_Nucleoli: Measures the prominence of nucleoli, which tend to be more visible and numerous in malignant cells.
Mitoses: Indicates the rate of cell division, where higher values suggest increased proliferative activity.
# Load the dataset
wbc <- read.csv("C:/Users/David Abraham/Desktop/Masters Studies/Unsupervised Learning/USL PROJECT 1/WBCdiag_02_withheader.csv")
# Inspect the dataset
str(wbc)
## 'data.frame': 683 obs. of 10 variables:
## $ MALIG : int -1 -1 -1 -1 -1 1 -1 -1 -1 -1 ...
## $ Clump_Thick : int 5 5 3 6 4 8 1 2 2 4 ...
## $ Uniform_Cell_Size : int 1 4 1 8 1 10 1 1 1 2 ...
## $ Uniform_Cell_Shape: int 1 4 1 8 1 10 1 2 1 1 ...
## $ Marginal_Adhes : int 1 5 1 1 3 8 1 1 1 1 ...
## $ Epit_Size : int 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare_Nuclei : int 1 10 2 4 1 10 10 1 1 1 ...
## $ Bland_Chrom : int 3 3 3 3 3 9 3 3 1 2 ...
## $ Norm_Nucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
summary(wbc)
## MALIG Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape
## Min. :-1.0000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.:-1.0000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median :-1.0000 Median : 4.000 Median : 1.000 Median : 1.000
## Mean :-0.3001 Mean : 4.442 Mean : 3.151 Mean : 3.215
## 3rd Qu.: 1.0000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000
## Max. : 1.0000 Max. :10.000 Max. :10.000 Max. :10.000
## Marginal_Adhes Epit_Size Bare_Nuclei Bland_Chrom
## Min. : 1.00 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.00 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 1.00 Median : 2.000 Median : 1.000 Median : 3.000
## Mean : 2.83 Mean : 3.234 Mean : 3.545 Mean : 3.445
## 3rd Qu.: 4.00 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000
## Max. :10.00 Max. :10.000 Max. :10.000 Max. :10.000
## Norm_Nucleoli Mitoses
## Min. : 1.00 Min. : 1.000
## 1st Qu.: 1.00 1st Qu.: 1.000
## Median : 1.00 Median : 1.000
## Mean : 2.87 Mean : 1.603
## 3rd Qu.: 4.00 3rd Qu.: 1.000
## Max. :10.00 Max. :10.000
# Check for missing values
colSums(is.na(wbc))
## MALIG Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape
## 0 0 0 0
## Marginal_Adhes Epit_Size Bare_Nuclei Bland_Chrom
## 0 0 0 0
## Norm_Nucleoli Mitoses
## 0 0
# Store the true class labels for later validation
true_labels <- wbc$MALIG
# Check class distribution
table(true_labels)
## true_labels
## -1 1
## 444 239
Before we begin any clustering, we need to make sure our data is cleaned, normalized and we remove the MALIG column, this is because we want the clustering to be based on all the features that do not contain the complete answer about whether a data point is or is not malignant. We hope to gain this insight as a result of clustering rather so it is important that such columns are removed.
# Extract only the 9 clinical features (remove MALIG column)
wbc_features <- wbc[, 2:10]
# Check the features
head(wbc_features)
summary(wbc_features)
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.00
## Mean : 4.442 Mean : 3.151 Mean : 3.215 Mean : 2.83
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## Epit_Size Bare_Nuclei Bland_Chrom Norm_Nucleoli
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.00
## Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.00
## Mean : 3.234 Mean : 3.545 Mean : 3.445 Mean : 2.87
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## Mitoses
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 1.000
## Mean : 1.603
## 3rd Qu.: 1.000
## Max. :10.000
# Standardize the data (z-score normalization)
wbc_features_z <- as.data.frame(lapply(wbc_features, scale))
# Verify standardization
summary(wbc_features_z)
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes
## Min. :-1.2203 Min. :-0.7017 Min. :-0.7412 Min. :-0.6389
## 1st Qu.:-0.8658 1st Qu.:-0.7017 1st Qu.:-0.7412 1st Qu.:-0.6389
## Median :-0.1568 Median :-0.7017 Median :-0.7412 Median :-0.6389
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5523 3rd Qu.: 0.6033 3rd Qu.: 0.5972 3rd Qu.: 0.4084
## Max. : 1.9703 Max. : 2.2345 Max. : 2.2702 Max. : 2.5029
## Epit_Size Bare_Nuclei Bland_Chrom Norm_Nucleoli
## Min. :-1.0050 Min. :-0.6983 Min. :-0.9981 Min. :-0.6125
## 1st Qu.:-0.5552 1st Qu.:-0.6983 1st Qu.:-0.5899 1st Qu.:-0.6125
## Median :-0.5552 Median :-0.6983 Median :-0.1817 Median :-0.6125
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3444 3rd Qu.: 0.6738 3rd Qu.: 0.6347 3rd Qu.: 0.3703
## Max. : 3.0434 Max. : 1.7716 Max. : 2.6758 Max. : 2.3358
## Mitoses
## Min. :-0.3481
## 1st Qu.:-0.3481
## Median :-0.3481
## Mean : 0.0000
## 3rd Qu.:-0.3481
## Max. : 4.8461
# Hopkins statistic (values > 0.5 indicate clusterable data)
hopkins(wbc_features_z, n = nrow(wbc_features_z) - 1)
## $H
## [1] 0.2716578
# Visual assessment of clustering tendency
get_clust_tendency(wbc_features_z, n = 50, graph = TRUE,
gradient = list(low = "red", mid = "white", high = "blue"))
## $hopkins_stat
## [1] 0.7123601
##
## $plot
# Ordered Dissimilarity Matrix
dist_matrix <- dist(wbc_features_z, method = "euclidean")
fviz_dist(dist_matrix, show_labels = FALSE) +
labs(title = "Breast Cancer Data - Distance Matrix")
Inorder to identify what is the optimal number of clusters, I started by looking at elbow method before moving on to silhoutte method, gap statistic and Calinski Harabasz between a range of 2 to 15 clusters.
# Elbow method for k-means
fviz_nbclust(wbc_features_z, kmeans, method = "wss",k.max = 15) +
labs(title = "Elbow Method - K-means")
# Silhouette method for k-means
fviz_nbclust(wbc_features_z, kmeans, method = "silhouette",k.max = 15) +
labs(title = "Silhouette Method - K-means")
# Gap statistic
fviz_nbclust(wbc_features_z,
kmeans ,
k.max = 15,
method = "gap_stat")
# NbClust - multiple indices (Calinski-Harabasz)
nb <- NbClust(wbc_features_z, distance = "euclidean",
min.nc = 2, max.nc = 15, method = "kmeans",
index = "ch")
cat("\nNbClust recommendation (CH index):\n")
##
## NbClust recommendation (CH index):
print(nb$Best.nc)
## Number_clusters Value_Index
## 2.0000 853.4126
# Optimal clusters using ClusterR package
opt_km <- Optimal_Clusters_KMeans(wbc_features_z, max_clusters = 15,
plot_clusters = TRUE, criterion = "silhouette")
The above results show us that K=2 is worth trying out for visualizing if using the k-means clustering algorithm.
# ============================================
# K-MEANS CLUSTERING
# ============================================
# Perform k-means with 2 clusters (expecting benign/malignant)
set.seed(123)
km_result <- kmeans(wbc_features_z, centers = 2, nstart = 25)
# Cluster sizes
km_result$size
## [1] 453 230
# Cluster centers
km_result$centers
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes Epit_Size
## 1 -0.4971820 -0.6087925 -0.6045807 -0.5179091 -0.512503
## 2 0.9792324 1.1990564 1.1907612 1.0200557 1.009408
## Bare_Nuclei Bland_Chrom Norm_Nucleoli Mitoses
## 1 -0.5886887 -0.5484556 -0.5307637 -0.3073751
## 2 1.1594607 1.0802192 1.0453737 0.6053953
# Add cluster assignment to original data
wbc$cluster_km <- km_result$cluster
# Visualize clusters
fviz_cluster(km_result, data = wbc_features_z,
ellipse.type = "norm", geom = "point",
stand = FALSE, palette = "jco", ggtheme = theme_classic()) +
labs(title = "K-means Clustering (k=2)")
# Compare with true labels
table(Predicted = km_result$cluster, Actual = true_labels)
## Actual
## Predicted -1 1
## 1 434 19
## 2 10 220
# Silhouette plot
sil_km <- silhouette(km_result$cluster, dist(wbc_features_z))
fviz_silhouette(sil_km) + labs(title = "Silhouette Plot - K-means")
## cluster size ave.sil.width
## 1 1 453 0.75
## 2 2 230 0.23
# Calinski-Harabasz index
round(calinhara(wbc_features_z, km_result$cluster), digits = 2)
## [1] 853.41
The above graphs show that k=2 shows 2 different clusters without much seperation where one of the clusters is clearly a lot more compact compared to the other one which aligns with the sillhoutte width scores. Let us try using PAM next.
# Determine optimal k using silhouette method
fviz_nbclust(wbc_features_z, pam, method = "silhouette", k.max = 15) +
labs(title = "Optimal Number of Clusters - PAM (Silhouette Method)")
# Determine optimal k using elbow method
fviz_nbclust(wbc_features_z, pam, method = "wss", k.max = 15) +
labs(title = "Optimal Number of Clusters - PAM (Elbow Method)")
# Manual CH calculation for PAM
cat("\n=== Calinski-Harabasz Index (PAM) ===\n")
##
## === Calinski-Harabasz Index (PAM) ===
ch_scores_pam <- numeric(9)
for (k in 2:15) {
set.seed(123)
pam_temp <- pam(wbc_features_z, k = k)
ch_scores_pam[k-1] <- calinhara(wbc_features_z, pam_temp$clustering)
}
# Plot
plot(2:15, ch_scores_pam, type = "b",
xlab = "Number of Clusters (k)",
ylab = "Calinski-Harabasz Index",
main = "PAM - Calinski-Harabasz Index by K",
col = "purple", pch = 19, lwd = 2)
grid()
# Perform PAM with k=2
set.seed(123)
pam_k2 <- pam(wbc_features_z, k = 2)
cat("\n=== PAM with k=2 ===\n")
##
## === PAM with k=2 ===
cat("Average Silhouette Width:", pam_k2$silinfo$avg.width, "\n")
## Average Silhouette Width: 0.571736
cat("Cluster sizes:", pam_k2$clusinfo[, 1], "\n")
## Cluster sizes: 450 233
# Visualize k=2
fviz_cluster(pam_k2, geom = "point", ellipse.type = "norm") +
labs(title = "PAM Clustering (k = 2)")
labs(title = "Silhouette Plot - PAM (k = 2)")
## <ggplot2::labels> List of 1
## $ title: chr "Silhouette Plot - PAM (k = 2)"
fviz_silhouette(pam_k2) +
labs(title = "Silhouette Plot - PAM (k = 2)")
## cluster size ave.sil.width
## 1 1 450 0.75
## 2 2 233 0.22
# Perform PAM with k=3
set.seed(123)
pam_k3 <- pam(wbc_features_z, k = 3)
cat("\n=== PAM with k=3 ===\n")
##
## === PAM with k=3 ===
cat("Average Silhouette Width:", pam_k3$silinfo$avg.width, "\n")
## Average Silhouette Width: 0.5501258
cat("Cluster sizes:", pam_k3$clusinfo[, 1], "\n")
## Cluster sizes: 450 198 35
# Visualize k=3
fviz_cluster(pam_k3, geom = "point", ellipse.type = "norm") +
labs(title = "PAM Clustering (k = 3)")
fviz_silhouette(pam_k3) +
labs(title = "Silhouette Plot - PAM (k = 3)")
## cluster size ave.sil.width
## 1 1 450 0.73
## 2 2 198 0.18
## 3 3 35 0.26
PAM shows a little more better results, not only is it a more robust method because it is based on medoids, but we see a little more separation between the clusters.
# Optimal number of clusters for hierarchical
fviz_nbclust(wbc_features_z, FUN = hcut, method = "wss") +
labs(title = "Elbow Method - Hierarchical")
fviz_nbclust(wbc_features_z, FUN = hcut, method = "silhouette") +
labs(title = "Silhouette Method - Hierarchical")
# Compute distance matrix
dist_matrix <- dist(wbc_features_z, method = "euclidean")
# Agglomerative - Complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")
plot(hc_complete, cex = 0.6, hang = -1,
main = "Dendrogram - Complete Linkage")
# Agglomerative - Ward's method
hc_ward <- hclust(dist_matrix, method = "ward.D2")
plot(hc_ward, cex = 0.6, hang = -1,
main = "Dendrogram - Ward's Method")
# Cut tree into 2 groups
hc_clusters <- cutree(hc_ward, k = 2)
# Number of members in each cluster
table(hc_clusters)
## hc_clusters
## 1 2
## 440 243
# Visualize cut
plot(hc_ward, cex = 0.6)
rect.hclust(hc_ward, k = 2, border = 2:3)
hc_clusters3 <- cutree(hc_ward, k = 3)
table(hc_clusters3)
## hc_clusters3
## 1 2 3
## 440 207 36
# Cluster plot
fviz_cluster(list(data = wbc_features_z, cluster = hc_clusters)) +
labs(title = "Hierarchical Clustering - Ward's Method k=2")
fviz_cluster(list(data = wbc_features_z, cluster = hc_clusters3)) +
labs(title = "Hierarchical Clustering - Ward's Method k=3")
# Compare with true labels
table(Predicted = hc_clusters, Actual = true_labels)
## Actual
## Predicted -1 1
## 1 431 9
## 2 13 230
# Compare different linkage methods
methods <- c("average", "single", "complete", "ward")
ac_values <- sapply(methods, function(m) {
agnes(wbc_features_z, method = m)$ac
})
cat("\nAgglomerative Coefficients:\n")
##
## Agglomerative Coefficients:
print(ac_values)
## average single complete ward
## 0.8896490 0.7915542 0.9241651 0.9906957
# AGNES (Agglomerative Nesting) - use "ward" not "ward.D2"
hc_agnes <- agnes(wbc_features_z, method = "ward")
pltree(hc_agnes, cex = 0.6, hang = -1,
main = "Dendrogram - AGNES")
cat("Agglomerative coefficient:", hc_agnes$ac, "\n")
## Agglomerative coefficient: 0.9906957
# AGNES (Agglomerative Nesting)
hc_agnes <- agnes(wbc_features_z, method = "ward")
pltree(hc_agnes, cex = 0.6, hang = -1,
main = "Dendrogram - AGNES")
cat("Agglomerative coefficient:", hc_agnes$ac, "\n")
## Agglomerative coefficient: 0.9906957
# DIANA (Divisive Analysis)
hc_diana <- diana(wbc_features_z)
pltree(hc_diana, cex = 0.6, hang = -1,
main = "Dendrogram - DIANA")
cat("Divisive coefficient:", hc_diana$dc, "\n")
## Divisive coefficient: 0.9171601
Hierarchical clustering yielded inferior results compared to PAM, with lower average silhouette widths indicating weaker cluster cohesion. The dendrograms across multiple linkage methods (Ward’s, Complete, Average) exhibited higher inter-cluster overlap, suggesting that hierarchical agglomeration struggles with the subtle boundaries between benign and malignant cases in this dataset.
# Determine optimal eps using k-NN distance plot
dbscan::kNNdistplot(wbc_features_z, k = 5)
abline(h = 1, lty = 2, col = "red") # Adjust this value based on plot
# Perform DBSCAN
set.seed(123)
db_result <- fpc::dbscan(wbc_features_z, eps = 1.5, MinPts = 8)
# Print results
print(db_result)
## dbscan Pts=683 MinPts=8 eps=1.5
## 0 1 2 3
## border 230 12 7 7
## seed 0 425 1 1
## total 230 437 8 8
# Plot DBSCAN results
plot(db_result, wbc_features_z, main = "DBSCAN", frame = FALSE)
# Visualize with factoextra
fviz_cluster(db_result, wbc_features_z, stand = FALSE,
frame = FALSE, geom = "point") +
labs(title = "DBSCAN Clustering")
# Check cluster membership (0 = noise/outlier)
table(db_result$cluster)
##
## 0 1 2 3
## 230 437 8 8
Even though, DBSCAN yields us the most seperated clusters, it also introduces a lot of noise and the number of clusters produces was very volatile to change in minPoints and eps values resulting in 1 giant cluster.
# Create a comprehensive summary for ALL points
cluster_summary <- data.frame(
Cluster = pam_k2$clustering,
MALIG = true_labels
)
# Count breakdown for each cluster
cat("=== Complete Breakdown - All Points ===\n\n")
## === Complete Breakdown - All Points ===
for (i in 1:2) {
cluster_points <- cluster_summary[cluster_summary$Cluster == i, ]
benign <- sum(cluster_points$MALIG == -1)
malignant <- sum(cluster_points$MALIG == 1)
total <- nrow(cluster_points)
cat("Cluster", i, ":\n")
cat(" Total points:", total, "\n")
cat(" Benign (-1):", benign, "\n")
cat(" Malignant (+1):", malignant, "\n")
cat(" % Benign:", round(benign/total*100, 2), "%\n")
cat(" % Malignant:", round(malignant/total*100, 2), "%\n\n")
}
## Cluster 1 :
## Total points: 450
## Benign (-1): 435
## Malignant (+1): 15
## % Benign: 96.67 %
## % Malignant: 3.33 %
##
## Cluster 2 :
## Total points: 233
## Benign (-1): 9
## Malignant (+1): 224
## % Benign: 3.86 %
## % Malignant: 96.14 %
# ============================================
# FEATURE VALUE RANGES BY CLUSTER
# ============================================
cat("\n=== Feature Value Ranges by Cluster ===\n\n")
##
## === Feature Value Ranges by Cluster ===
# Add cluster assignment to original data
wbc_with_cluster <- wbc
wbc_with_cluster$cluster <- pam_k2$clustering
for (i in 1:2) {
cat("CLUSTER", i, ":\n")
cluster_data <- wbc_with_cluster[wbc_with_cluster$cluster == i, 2:10]
# Calculate statistics for each feature
cat("Feature Min Q1 Median Q3 Max Mean\n")
cat("---------------------------------------------------------------\n")
for (feat in colnames(cluster_data)) {
stats <- summary(cluster_data[[feat]])
cat(sprintf("%-20s %4.1f %4.1f %4.1f %4.1f %4.1f %4.1f\n",
feat,
stats[1], stats[2], stats[3], stats[5], stats[6], stats[4]))
}
cat("\n")
}
## CLUSTER 1 :
## Feature Min Q1 Median Q3 Max Mean
## ---------------------------------------------------------------
## Clump_Thick 1.0 1.0 3.0 4.0 10.0 3.0
## Uniform_Cell_Size 1.0 1.0 1.0 1.0 5.0 1.3
## Uniform_Cell_Shape 1.0 1.0 1.0 1.0 8.0 1.4
## Marginal_Adhes 1.0 1.0 1.0 1.0 10.0 1.3
## Epit_Size 1.0 2.0 2.0 2.0 10.0 2.1
## Bare_Nuclei 1.0 1.0 1.0 1.0 10.0 1.3
## Bland_Chrom 1.0 1.0 2.0 3.0 7.0 2.1
## Norm_Nucleoli 1.0 1.0 1.0 1.0 9.0 1.2
## Mitoses 1.0 1.0 1.0 1.0 8.0 1.1
##
## CLUSTER 2 :
## Feature Min Q1 Median Q3 Max Mean
## ---------------------------------------------------------------
## Clump_Thick 1.0 5.0 8.0 10.0 10.0 7.2
## Uniform_Cell_Size 1.0 4.0 7.0 10.0 10.0 6.8
## Uniform_Cell_Shape 1.0 5.0 7.0 9.0 10.0 6.7
## Marginal_Adhes 1.0 3.0 5.0 8.0 10.0 5.7
## Epit_Size 2.0 4.0 5.0 7.0 10.0 5.5
## Bare_Nuclei 1.0 5.0 10.0 10.0 10.0 7.9
## Bland_Chrom 1.0 4.0 7.0 8.0 10.0 6.1
## Norm_Nucleoli 1.0 3.0 6.0 10.0 10.0 6.0
## Mitoses 1.0 1.0 1.0 3.0 10.0 2.6
# Compare means between clusters
cat("\n=== Mean Feature Values by Cluster ===\n")
##
## === Mean Feature Values by Cluster ===
cluster_means <- aggregate(wbc[, 2:10],
by = list(Cluster = pam_k2$clustering),
FUN = mean)
print(round(cluster_means, 2))
## Cluster Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes
## 1 1 3.03 1.28 1.41 1.34
## 2 2 7.18 6.76 6.70 5.70
## Epit_Size Bare_Nuclei Bland_Chrom Norm_Nucleoli Mitoses
## 1 2.08 1.30 2.08 1.24 1.09
## 2 5.45 7.87 6.08 6.01 2.59
# Difference in means (Cluster 1 - Cluster 2)
cat("\n=== Difference in Means (Cluster 1 - Cluster 2) ===\n")
##
## === Difference in Means (Cluster 1 - Cluster 2) ===
mean_diff <- cluster_means[1, -1] - cluster_means[2, -1]
mean_diff_sorted <- sort(abs(as.numeric(mean_diff)), decreasing = TRUE, index.return = TRUE)
cat("\nFeatures ranked by difference:\n")
##
## Features ranked by difference:
for (i in 1:9) {
feat_idx <- mean_diff_sorted$ix[i]
feat_name <- names(mean_diff)[feat_idx]
diff_val <- mean_diff[feat_idx]
cat(sprintf("%d. %-25s Difference: %6.2f\n", i, feat_name, diff_val))
}
## 1. Bare_Nuclei Difference: -6.57
## 2. Uniform_Cell_Size Difference: -5.47
## 3. Uniform_Cell_Shape Difference: -5.28
## 4. Norm_Nucleoli Difference: -4.77
## 5. Marginal_Adhes Difference: -4.36
## 6. Clump_Thick Difference: -4.15
## 7. Bland_Chrom Difference: -4.00
## 8. Epit_Size Difference: -3.37
## 9. Mitoses Difference: -1.49
# ============================================
# TYPICAL VALUE RANGES FOR CLASSIFICATION
# ============================================
cat("\n=== Typical Ranges for Benign vs Malignant ===\n\n")
##
## === Typical Ranges for Benign vs Malignant ===
# Identify which cluster is benign/malignant
cluster1_malig_pct <- sum(wbc_with_cluster$cluster == 1 & wbc_with_cluster$MALIG == 1) /
sum(wbc_with_cluster$cluster == 1)
if (cluster1_malig_pct > 0.5) {
malig_cluster <- 1
benign_cluster <- 2
} else {
malig_cluster <- 2
benign_cluster <- 1
}
cat("Benign Cluster:", benign_cluster, "\n")
## Benign Cluster: 1
cat("Malignant Cluster:", malig_cluster, "\n\n")
## Malignant Cluster: 2
# Show typical ranges
benign_data <- wbc_with_cluster[wbc_with_cluster$cluster == benign_cluster, 2:10]
malig_data <- wbc_with_cluster[wbc_with_cluster$cluster == malig_cluster, 2:10]
cat("BENIGN TUMORS - Typical Ranges (25th-75th percentile):\n")
## BENIGN TUMORS - Typical Ranges (25th-75th percentile):
cat("Feature Q1-Q3 Range Median\n")
## Feature Q1-Q3 Range Median
cat("----------------------------------------------------\n")
## ----------------------------------------------------
for (feat in colnames(benign_data)) {
q1 <- quantile(benign_data[[feat]], 0.25)
q3 <- quantile(benign_data[[feat]], 0.75)
med <- median(benign_data[[feat]])
cat(sprintf("%-20s %4.1f - %4.1f %4.1f\n", feat, q1, q3, med))
}
## Clump_Thick 1.0 - 4.0 3.0
## Uniform_Cell_Size 1.0 - 1.0 1.0
## Uniform_Cell_Shape 1.0 - 1.0 1.0
## Marginal_Adhes 1.0 - 1.0 1.0
## Epit_Size 2.0 - 2.0 2.0
## Bare_Nuclei 1.0 - 1.0 1.0
## Bland_Chrom 1.0 - 3.0 2.0
## Norm_Nucleoli 1.0 - 1.0 1.0
## Mitoses 1.0 - 1.0 1.0
cat("\nMALIGNANT TUMORS - Typical Ranges (25th-75th percentile):\n")
##
## MALIGNANT TUMORS - Typical Ranges (25th-75th percentile):
cat("Feature Q1-Q3 Range Median\n")
## Feature Q1-Q3 Range Median
cat("----------------------------------------------------\n")
## ----------------------------------------------------
for (feat in colnames(malig_data)) {
q1 <- quantile(malig_data[[feat]], 0.25)
q3 <- quantile(malig_data[[feat]], 0.75)
med <- median(malig_data[[feat]])
cat(sprintf("%-20s %4.1f - %4.1f %4.1f\n", feat, q1, q3, med))
}
## Clump_Thick 5.0 - 10.0 8.0
## Uniform_Cell_Size 4.0 - 10.0 7.0
## Uniform_Cell_Shape 5.0 - 9.0 7.0
## Marginal_Adhes 3.0 - 8.0 5.0
## Epit_Size 4.0 - 7.0 5.0
## Bare_Nuclei 5.0 - 10.0 10.0
## Bland_Chrom 4.0 - 8.0 7.0
## Norm_Nucleoli 3.0 - 10.0 6.0
## Mitoses 1.0 - 3.0 1.0
cat("=== Confusion Matrix ===\n")
## === Confusion Matrix ===
confusion_matrix <- table(Predicted = pam_k2$clustering, Actual = true_labels)
print(confusion_matrix)
## Actual
## Predicted -1 1
## 1 435 15
## 2 9 224
# ============================================
# VISUALIZATIONS
# ============================================
# Percentage stacked bar plot
cluster_pct <- prop.table(table(cluster_summary$Cluster, cluster_summary$MALIG), margin = 1) * 100
barplot(t(cluster_pct),
col = c("lightblue", "lightcoral"),
main = "Cluster Composition: Percentage Distribution",
xlab = "Cluster",
ylab = "Percentage (%)",
legend = c("Benign (-1)", "Malignant (+1)"),
args.legend = list(x = "topright", title = "MALIG Status"),
ylim = c(0, 100))
# Add percentage labels
for (i in 1:2) {
text(x = (i-1)*1.2 + 0.7, y = cluster_pct[i,1]/2,
labels = paste0(round(cluster_pct[i,1], 1), "%"),
col = "darkblue", font = 2, cex = 1.1)
text(x = (i-1)*1.2 + 0.7, y = cluster_pct[i,1] + cluster_pct[i,2]/2,
labels = paste0(round(cluster_pct[i,2], 1), "%"),
col = "darkred", font = 2, cex = 1.1)
}
The above graph show us that we can draw some meaningful insights. The data can be grouped into 2 clusters and this makes sense as the entire dataset contains features for aiding in breast cancer detection. On taking a deeper look at values of the MALIG column inside both clusters it is very evident that cluster 1 is predominantly Benign with a few exceptions while cluster 2 is Malignant. This is actually an amazing finding because without any labels the values of the remaining 9 dimensions were useful enough to make this classification highly accurate.
The Feature Value Ranges tables highlight the key differences between the two clusters which is aligned with cytological findings.
Uniformity of Cell Size and Shape: Cluster 1 shows median values of 1.0, indicating well-formed, uniform cells, whereas Cluster 2 has higher medians of 7.0 for size and 7.0 for shape, reflecting disorganized and irregular cell structures typical of malignancy.
Bare Nuclei: There’s a striking difference in Bare Nuclei, with Cluster 1 showing a low mean of 1.3, while Cluster 2 has a mean of 7.9, suggesting the presence of abnormal, absent nuclei, which is more common in malignant cases.
Mitoses: Mitoses are almost absent in Cluster 1 with a mean of 1.1, while Cluster 2 shows a significantly higher mean of 2.6, indicating more active cell division, a key indicator of malignancy.
# ============================================
# DIMENSION REDUCTION - MDS & PCA
# ============================================
# Prepare data
wbc_features_z <- as.data.frame(lapply(wbc[, 2:10], scale))
# ============================================
# CORRELATION ANALYSIS
# ============================================
# Check if variables are correlated (necessary for dimension reduction)
wbc_cor <- cor(wbc[, 2:10], method = "pearson")
print(round(wbc_cor, 2))
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape
## Clump_Thick 1.00 0.64 0.65
## Uniform_Cell_Size 0.64 1.00 0.91
## Uniform_Cell_Shape 0.65 0.91 1.00
## Marginal_Adhes 0.49 0.71 0.69
## Epit_Size 0.52 0.75 0.72
## Bare_Nuclei 0.59 0.69 0.71
## Bland_Chrom 0.55 0.76 0.74
## Norm_Nucleoli 0.53 0.72 0.72
## Mitoses 0.35 0.46 0.44
## Marginal_Adhes Epit_Size Bare_Nuclei Bland_Chrom
## Clump_Thick 0.49 0.52 0.59 0.55
## Uniform_Cell_Size 0.71 0.75 0.69 0.76
## Uniform_Cell_Shape 0.69 0.72 0.71 0.74
## Marginal_Adhes 1.00 0.59 0.67 0.67
## Epit_Size 0.59 1.00 0.59 0.62
## Bare_Nuclei 0.67 0.59 1.00 0.68
## Bland_Chrom 0.67 0.62 0.68 1.00
## Norm_Nucleoli 0.60 0.63 0.58 0.67
## Mitoses 0.42 0.48 0.34 0.35
## Norm_Nucleoli Mitoses
## Clump_Thick 0.53 0.35
## Uniform_Cell_Size 0.72 0.46
## Uniform_Cell_Shape 0.72 0.44
## Marginal_Adhes 0.60 0.42
## Epit_Size 0.63 0.48
## Bare_Nuclei 0.58 0.34
## Bland_Chrom 0.67 0.35
## Norm_Nucleoli 1.00 0.43
## Mitoses 0.43 1.00
# Visualize correlation matrix
corrplot(wbc_cor)
The correlation matrix shows strong positive relationships among most features in the dataset, particularly between Uniform Cell Size, Uniform Cell Shape, Bare Nuclei, Bland Chromatin, and Normal Nucleoli, indicating that these variables capture closely related aspects of cellular abnormality. Clump Thickness and Epithelial Cell Size also exhibit moderate to strong correlations.
In contrast, Mitoses shows comparatively weaker correlations with other variables, suggesting that it represents a more independent source of variation. But overall this is a good sign that dimension reduction will be possible with the dataset!
# Standard PCA
pca_result <- prcomp(wbc_features_z, center = FALSE, scale. = FALSE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
## Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
## Cumulative Proportion 0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
## PC8 PC9
## Standard deviation 0.51062 0.29729
## Proportion of Variance 0.02897 0.00982
## Cumulative Proportion 0.99018 1.00000
# Scree plot - variance explained
fviz_eig(pca_result, addlabels = TRUE,
main = "Scree Plot - Variance Explained by Each PC")
# Check eigenvalues
eig_val <- get_eigenvalue(pca_result)
print(eig_val)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.89949935 65.5499928 65.54999
## Dim.2 0.77594689 8.6216321 74.17162
## Dim.3 0.53925224 5.9916916 80.16332
## Dim.4 0.45962745 5.1069717 85.27029
## Dim.5 0.38027583 4.2252870 89.49558
## Dim.6 0.30187645 3.3541828 92.84976
## Dim.7 0.29440271 3.2711413 96.12090
## Dim.8 0.26073586 2.8970651 99.01796
## Dim.9 0.08838322 0.9820358 100.00000
# How many PCs explain 80% of variance?
cumsum(eig_val$variance.percent)
## [1] 65.54999 74.17162 80.16332 85.27029 89.49558 92.84976 96.12090
## [8] 99.01796 100.00000
# Biplot - variables and observations
fviz_pca_biplot(pca_result,
geom.ind = "point",
col.ind = ifelse(true_labels == -1, "blue", "red"),
col.var = "black",
repel = TRUE,
title = "PCA Biplot - Breast Cancer Data")
# Variable contributions to PC1 and PC2
fviz_contrib(pca_result, choice = "var", axes = 1, top = 9,
title = "Contribution to PC1")
fviz_contrib(pca_result, choice = "var", axes = 2, top = 9,
title = "Contribution to PC2")
# Show loadings for first 2 PCs
cat("\nLoadings for PC1 and PC2:\n")
##
## Loadings for PC1 and PC2:
print(pca_result$rotation[, 1:2])
## PC1 PC2
## Clump_Thick -0.3020626 -0.14080053
## Uniform_Cell_Size -0.3807930 -0.04664031
## Uniform_Cell_Shape -0.3775825 -0.08242247
## Marginal_Adhes -0.3327236 -0.05209438
## Epit_Size -0.3362340 0.16440439
## Bare_Nuclei -0.3350675 -0.26126062
## Bland_Chrom -0.3457474 -0.22807676
## Norm_Nucleoli -0.3355914 0.03396582
## Mitoses -0.2302064 0.90555729
From the plot of the variables we can see the Mitoses is the only feature that is the quite isolated which was consistent with our correlation matrix.
According to Kaiser rule we only need to choose the first principal component since its the only component with eigenvalue > 1. But I want to aim to choose the least number of components that can explain ~75% of cumalitive variance in my dataset. Since PC1 can only explain 65.5% on its own, I believe we need to also include PC2.
# PCA with varimax rotation for easier interpretation
library(psych)
pca_rotated <- principal(wbc_features_z, nfactors = 2, rotate = "varimax")
print(pca_rotated)
## Principal Components Analysis
## Call: principal(r = wbc_features_z, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## Clump_Thick 0.72 0.18 0.55 0.446 1.1
## Uniform_Cell_Size 0.87 0.33 0.86 0.143 1.3
## Uniform_Cell_Shape 0.87 0.30 0.85 0.154 1.2
## Marginal_Adhes 0.76 0.28 0.66 0.345 1.3
## Epit_Size 0.69 0.46 0.69 0.312 1.7
## Bare_Nuclei 0.84 0.11 0.72 0.285 1.0
## Bland_Chrom 0.85 0.15 0.75 0.254 1.1
## Norm_Nucleoli 0.74 0.35 0.67 0.335 1.4
## Mitoses 0.20 0.95 0.95 0.051 1.1
##
## RC1 RC2
## SS loadings 5.09 1.58
## Proportion Var 0.57 0.18
## Cumulative Var 0.57 0.74
## Proportion Explained 0.76 0.24
## Cumulative Proportion 0.76 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.05
## with the empirical chi square 123.18 with prob < 2.8e-17
##
## Fit based upon off diagonal values = 0.99
# Print significant loadings only
cat("\nSignificant Loadings (cutoff = 0.4):\n")
##
## Significant Loadings (cutoff = 0.4):
print(loadings(pca_rotated), digits = 3, cutoff = 0.4, sort = TRUE)
##
## Loadings:
## RC1 RC2
## Clump_Thick 0.723
## Uniform_Cell_Size 0.865
## Uniform_Cell_Shape 0.871
## Marginal_Adhes 0.760
## Epit_Size 0.692 0.457
## Bare_Nuclei 0.838
## Bland_Chrom 0.851
## Norm_Nucleoli 0.736
## Mitoses 0.954
##
## RC1 RC2
## SS loadings 5.095 1.581
## Proportion Var 0.566 0.176
## Cumulative Var 0.566 0.742
# Quality measures
cat("\nComplexity (lower is better):\n")
##
## Complexity (lower is better):
print(pca_rotated$complexity)
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes
## 1.119448 1.282949 1.229359 1.263172
## Epit_Size Bare_Nuclei Bland_Chrom Norm_Nucleoli
## 1.731604 1.035236 1.060832 1.431011
## Mitoses
## 1.085271
cat("\nUniqueness (lower is better):\n")
##
## Uniqueness (lower is better):
print(pca_rotated$uniqueness)
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes
## 0.44633609 0.14286526 0.15364542 0.34479029
## Epit_Size Bare_Nuclei Bland_Chrom Norm_Nucleoli
## 0.31206905 0.28469792 0.25440254 0.33469392
## Mitoses
## 0.05105328
# Plot complexity vs uniqueness
plot(pca_rotated$complexity, pca_rotated$uniqueness,
main = "Complexity vs Uniqueness",
xlab = "Complexity", ylab = "Uniqueness",
pch = 19, col = "steelblue")
text(pca_rotated$complexity, pca_rotated$uniqueness,
labels = names(pca_rotated$complexity), pos = 3, cex = 0.7)
abline(h = 0.5, v = 1.5, lty = 2, col = "red")
With the 2 rotated components, the loadings reveal a clear interpretative structure. RC1 captures general cellular abnormality, with high loadings across morphological features (Uniform_Cell_Shape 0.871, Uniform_Cell_Size 0.865, Bland_Chrom 0.851). RC2 on the other hand on cell division activity as it isolates mitotic activity (Mitoses 0.954).
print(round(wbc_cor, 2))
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape
## Clump_Thick 1.00 0.64 0.65
## Uniform_Cell_Size 0.64 1.00 0.91
## Uniform_Cell_Shape 0.65 0.91 1.00
## Marginal_Adhes 0.49 0.71 0.69
## Epit_Size 0.52 0.75 0.72
## Bare_Nuclei 0.59 0.69 0.71
## Bland_Chrom 0.55 0.76 0.74
## Norm_Nucleoli 0.53 0.72 0.72
## Mitoses 0.35 0.46 0.44
## Marginal_Adhes Epit_Size Bare_Nuclei Bland_Chrom
## Clump_Thick 0.49 0.52 0.59 0.55
## Uniform_Cell_Size 0.71 0.75 0.69 0.76
## Uniform_Cell_Shape 0.69 0.72 0.71 0.74
## Marginal_Adhes 1.00 0.59 0.67 0.67
## Epit_Size 0.59 1.00 0.59 0.62
## Bare_Nuclei 0.67 0.59 1.00 0.68
## Bland_Chrom 0.67 0.62 0.68 1.00
## Norm_Nucleoli 0.60 0.63 0.58 0.67
## Mitoses 0.42 0.48 0.34 0.35
## Norm_Nucleoli Mitoses
## Clump_Thick 0.53 0.35
## Uniform_Cell_Size 0.72 0.46
## Uniform_Cell_Shape 0.72 0.44
## Marginal_Adhes 0.60 0.42
## Epit_Size 0.63 0.48
## Bare_Nuclei 0.58 0.34
## Bland_Chrom 0.67 0.35
## Norm_Nucleoli 1.00 0.43
## Mitoses 0.43 1.00
# Convert correlation matrix to distance matrix (dissimilarity)
library(smacof)
library(ape)
dis2 <- sim2diss(wbc_cor, method = 1, to.dist = TRUE)
# Check the distance matrix
print(dis2)
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape
## Uniform_Cell_Size 0.35751851
## Uniform_Cell_Shape 0.34653001 0.09277177
## Marginal_Adhes 0.51217128 0.29302305 0.31405194
## Epit_Size 0.47640396 0.24645598 0.27753759
## Bare_Nuclei 0.40690856 0.30829125 0.28612245
## Bland_Chrom 0.44625755 0.24444084 0.26465650
## Norm_Nucleoli 0.46593409 0.28065396 0.28203659
## Mitoses 0.64904283 0.53924530 0.55874242
## Marginal_Adhes Epit_Size Bare_Nuclei Bland_Chrom
## Uniform_Cell_Size
## Uniform_Cell_Shape
## Marginal_Adhes
## Epit_Size 0.40545223
## Bare_Nuclei 0.32935171 0.41428387
## Bland_Chrom 0.33143294 0.38187210 0.31938514
## Norm_Nucleoli 0.39687894 0.37107360 0.41571980 0.33439847
## Mitoses 0.58110167 0.51941670 0.66078956 0.65398911
## Norm_Nucleoli
## Uniform_Cell_Size
## Uniform_Cell_Shape
## Marginal_Adhes
## Epit_Size
## Bare_Nuclei
## Bland_Chrom
## Norm_Nucleoli
## Mitoses 0.56624273
# Mantel test – for similarity of matrices
# H0 - the compared matrices are random, thus they are different
# H1 - the non-random pattern, interpreted as a similarity of the matrices
mantel_result <- mantel.test(as.matrix(wbc_cor), as.matrix(dis2))
print(mantel_result)
## $z.stat
## [1] 8.016593
##
## $p
## [1] 0.001
##
## $alternative
## [1] "two.sided"
# Perform MDS using the smacof package
fit.data <- mds(dis2, ndim = 2, type = "ratio")
# Print the MDS results summary
summary(fit.data)
##
## Configurations:
## D1 D2
## Clump_Thick -0.2568 0.8022
## Uniform_Cell_Size -0.0377 -0.0085
## Uniform_Cell_Shape -0.1165 0.0909
## Marginal_Adhes -0.2457 -0.6323
## Epit_Size 0.3465 0.2775
## Bare_Nuclei -0.6620 0.1468
## Bland_Chrom -0.4914 -0.2674
## Norm_Nucleoli 0.2690 -0.4493
## Mitoses 1.1945 0.0401
##
##
## Stress per point (in %):
## Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes
## 11.51 2.31 3.22 16.31
## Epit_Size Bare_Nuclei Bland_Chrom Norm_Nucleoli
## 13.57 11.08 11.19 15.23
## Mitoses
## 15.57
# Simple MDS plot
plot(fit.data, main = "MDS - Distance Matrix", xlab = "Dimension 1", ylab = "Dimension 2")
stress_values <- c()
# Loop over the number of dimensions (1 to 9)
for (k in 1:8) {
# Apply MDS with 'k' dimensions
fit_data <- mds(dis2, ndim = k, type = "ratio")
# Extract the stress value
stress_values <- c(stress_values, fit_data$stress)
# Print the stress for the current dimension
cat("Stress for", k, "dimensions:", fit_data$stress, "\n")
}
## Stress for 1 dimensions: 0.3997401
## Stress for 2 dimensions: 0.1834531
## Stress for 3 dimensions: 0.106796
## Stress for 4 dimensions: 0.06353847
## Stress for 5 dimensions: 0.04379878
## Stress for 6 dimensions: 0.02140813
## Stress for 7 dimensions: 0.006328878
## Stress for 8 dimensions: 5.676903e-16
# Plot stress vs. number of dimensions
plot(1:8, stress_values, type = "b",
xlab = "Number of Dimensions", ylab = "Stress",
main = "Stress vs. Number of Dimensions",
col = "blue", pch = 19)
# Generate random stress baseline
set.seed(123)
stressvec <- randomstress(n = 9, ndim = 2, nrep = 100)
random_stress_mean <- mean(stressvec)
cat("\n=== MDS Quality Assessment ===\n")
##
## === MDS Quality Assessment ===
cat("Random stress (baseline):", random_stress_mean, "\n")
## Random stress (baseline): 0.3115778
cat("Empirical stress (2D MDS):", fit.data$stress, "\n")
## Empirical stress (2D MDS): 0.1834531
# Quality ratio (lower is better)
stress_ratio <- fit.data$stress / random_stress_mean
cat("Stress ratio (empirical/random):", stress_ratio, "\n\n")
## Stress ratio (empirical/random): 0.5887873
# Interpretation
cat("Your 2D stress:", round(fit.data$stress, 3), "=",
ifelse(fit.data$stress > 0.20, "VERY POOR",
ifelse(fit.data$stress > 0.10, "POOR",
ifelse(fit.data$stress > 0.05, "FAIR",
ifelse(fit.data$stress > 0.025, "GOOD", "EXCELLENT")))), "\n")
## Your 2D stress: 0.183 = POOR
The Mantel test confirms valid similarity-to-dissimilarity conversion (z=8.02, p=0.001).
The stress ratio of 0.589 indicates that the 2D MDS configuration is 41% better than a random arrangement, falling in the moderate fit range. While this confirms the solution captures some meaningful distance structure rather than random noise, the moderate improvement (ratio > 0.5) suggests the 2D representation is not ideal.
MDS requires 4 or 5 dimensions to achieve ‘good’ stress values (<0.05) according to Kruskal’s benchmarks as the 2 dimension MDS solution yields a stress of 0.183 which is considered closer to ‘poor’ than ‘fair’.
This indicates that PCA might be the superior dimensionality reduction method for this dataset. With just 2 principal components, PCA retains 74.1% of the variance, meeting the ~75% threshold I was looking for.
library(ggfortify)
wbc$Class <- factor(ifelse(true_labels == -1, "Benign", "Malignant"))
autoplot(
pca_result,
data = wbc,
colour = "Class",
loadings = TRUE,
loadings.label = TRUE,
loadings.label.size = 3,
loadings.colour = "darkgreen"
) +
scale_colour_manual(values = c("Benign" = "blue", "Malignant" = "red")) +
labs(
title = "PCA - Breast Cancer Features with Loadings",
colour = "True Class"
)
wbc$Cluster <- factor(pam_k2$clustering,
labels = c("Cluster 1", "Cluster 2"))
autoplot(
pca_result,
data = wbc,
colour = "Cluster",
loadings = TRUE,
loadings.label = TRUE,
loadings.label.size = 3,
loadings.colour = "darkgreen"
) +
scale_colour_manual(values = c("Cluster 1" = "blue", "Cluster 2" = "red")) +
labs(
title = "PCA - Breast Cancer Features with Loadings (PAM Clusters)",
colour = "PAM Cluster"
)
The PCA biplots show that the first two components explain approximately 74% of the total variance, with a clear separation of observations along PC1. Variables related to cell size, shape, and nuclear features contribute most strongly to this separation, while mitoses primarily influence PC2.
Coloring by true labels and PAM clusters results in nearly identical groupings. This confirms that PCA preserves the information most relevant for clustering in a low-dimensional space.
pca_2d <- pca_result$x[, 1:2]
fviz_nbclust(pca_2d, pam, method = "silhouette", k.max = 15) +
labs(title = "Optimal Number of Clusters - PAM after PCA")
# PAM on PCA results (k=2)
pam_pca <- pam(pca_2d, k = 2)
# Visualize PAM on PCA
fviz_cluster(pam_pca, data = pca_2d,
ellipse.type = "norm", geom = "point",
palette = "jco") +
labs(title = "PAM (k=2) on PCA-reduced Data")
# Silhouette plot for PAM on PCA
fviz_silhouette(pam_pca) +
labs(title = "Silhouette Plot on PCA applied data")
## cluster size ave.sil.width
## 1 1 442 0.88
## 2 2 241 0.47
# Compare PAM with true labels
cat("\n=== PAM on PCA vs True Labels ===\n")
##
## === PAM on PCA vs True Labels ===
table(Predicted = pam_pca$clustering, Actual = true_labels)
## Actual
## Predicted -1 1
## 1 433 9
## 2 11 230
cat("\nPAM on PCA - Average Silhouette Width:",
round(pam_pca$silinfo$avg.width, 4), "\n")
##
## PAM on PCA - Average Silhouette Width: 0.7346
# Summary comparison
cat("\n=== CLUSTERING ON REDUCED DIMENSIONS SUMMARY ===\n")
##
## === CLUSTERING ON REDUCED DIMENSIONS SUMMARY ===
cat("PAM on PCA - Silhouette:",
round(pam_pca$silinfo$avg.width, 4), "\n")
## PAM on PCA - Silhouette: 0.7346
Running PAM on top of PCA based clustering resulted in slightly more
accurate results as the silhouette widths on both clusters have
increased compared to running clustering without dimension reduction.
The optimal number of clusters still being 2 compared to when clustering
was ran with all 9 dimensions show that we were able to effectively
reduce the number of dimensions from 9 to 2 which is quite
effective!
Lastly on comparing the classification results of clustering vs true
labels of the data points before and after dimension reduction, it is
safe to say that accuracy of the classification overall improved
marginally.
This analysis successfully demonstrated the efficacy of unsupervised learning techniques in discerning the underlying structure on breast cancer related features.
The clustering analysis effectively partitioned the data into two distinct groups, which strongly corresponded to the known benign and malignant classes. The high purity of these clusters, with approximately 97% class dominance in each, highlights the capability of clustering algorithms to support diagnostic classification. Among the various methods evaluated, Partitioning Around Medoids (PAM) emerged as the most effective algorithm for this dataset, likely due to its robustness to outliers and its efficiency on smaller datasets.
Furthermore, dimensionality reduction via Principal Component Analysis (PCA) proved highly successful, reducing the feature space from nine dimensions to two principal components while still explaining approximately 75% of the dataset’s variance. The interpretability of these components was a significant finding; they were primarily driven by morphological features such as Uniform Cell Shape, Uniform Cell Size, and Bland Chromatin, alongside the cell division indicator Mitoses. When clustering was reapplied to this reduced two-dimensional dataset, the results were not only consistent but also yielded higher average silhouette widths, validating that the dimensionality reduction retained essential information and improved cluster cohesion. A comparative analysis with Multidimensional Scaling (MDS) confirmed PCA’s superiority for this dataset, as PCA achieved a more efficient and representative reduction in two dimensions according to Kruskal’s stress benchmarks.
In conclusion, this analysis validates that clustering is a powerful tool for identifying intrinsic data structures that align with clinical classifications. Moreover, the successful application of PCA followed by clustering underscores the value of a combined unsupervised learning workflow, which can simplify complex datasets, enhance model performance, and provide interpretable insights into the most significant features.
Significance of Mitoses:
https://www.ncbi.nlm.nih.gov/books/NBK9553/#:~:text=Mitoses%20are%20characteristic%20of%20malignant,dissymmetrical%20structures%20and%20atypical%20forms
Trends of Cancerous vs Normal Cells:
https://www.verywellhealth.com/cancer-cells-vs-normal-cells-2248794#:~:text=Under%20a%20microscope%2C%20normal%20cells,some%20are%20smaller%20than%20normal
MDS breakdown:
https://towardsdatascience.com/multidimensional-scaling-mds-for-dimensionality-reduction-and-data-visualization-d5252c8bc4c0/