Introduction

Breast cancer is one of the most commonly diagnosed cancers worldwide and remains a major public health concern. Early detection plays a critical role in improving patient outcomes, as timely diagnosis significantly increases the chances of successful treatment. Because of this, it is a well-suited and meaningful domain for testing out whether unsupervised learning techniques like clustering and dimension reduction perform well.

Clustering Analysis

# Install packages (uncomment if needed)
# install.packages("factoextra")
# install.packages("cluster")
# install.packages("fpc")
# install.packages("clustertend")
# install.packages("ClusterR")
# install.packages("flexclust")
# install.packages("NbClust")
# install.packages("tidyverse")
# install.packages("dendextend")
# install.packages("dbscan")
# install.packages("smacof")
# install.packages("ggfortify")
# install.packages("corrplot")
# install.packages("ape")
# install.packages("psych")

# Load required packages
library(factoextra)
library(psych)
library(corrplot)
library(smacof)
library(ggfortify)
library(cluster)
library(fpc)
library(clustertend)
library(ClusterR)
library(flexclust)
library(NbClust)
library(tidyverse)
library(dendextend)
library(dbscan)
library(ape)


The dataset chosen for this project can be found at: Dataset Link

Each observation in the dataset represents a breast cell sample described using the following features:

  • MALIG: Diagnostic indicator used only for interpretation, where −1 denotes benign tumors and 1 denotes malignant tumors.

  • Clump_Thick: Measures the thickness of cell clumps, with higher values indicating denser and more irregular cell groupings.

  • Uniform_Cell_Size: Assesses the consistency of cell sizes, where higher values reflect greater variability, often associated with malignancy.

  • Uniform_Cell_Shape: Evaluates uniformity in cell shape, with higher scores indicating irregular and abnormal cell morphology.

  • Marginal_Adhes: Represents how strongly cells adhere to one another, with lower adhesion commonly observed in malignant cells.

  • Epit_Size: Measures the size of epithelial cells, where larger values may indicate abnormal cell growth.

  • Bare_Nuclei: Counts the presence of bare nuclei, which are more frequently found in malignant samples.

  • Bland_Chrom: Describes the texture and uniformity of chromatin in the nucleus, with higher values indicating irregular chromatin patterns.

  • Norm_Nucleoli: Measures the prominence of nucleoli, which tend to be more visible and numerous in malignant cells.

  • Mitoses: Indicates the rate of cell division, where higher values suggest increased proliferative activity.

1. Data Loading and Exploration

# Load the dataset
wbc <- read.csv("C:/Users/David Abraham/Desktop/Masters Studies/Unsupervised Learning/USL PROJECT 1/WBCdiag_02_withheader.csv")

# Inspect the dataset
str(wbc)
## 'data.frame':    683 obs. of  10 variables:
##  $ MALIG             : int  -1 -1 -1 -1 -1 1 -1 -1 -1 -1 ...
##  $ Clump_Thick       : int  5 5 3 6 4 8 1 2 2 4 ...
##  $ Uniform_Cell_Size : int  1 4 1 8 1 10 1 1 1 2 ...
##  $ Uniform_Cell_Shape: int  1 4 1 8 1 10 1 2 1 1 ...
##  $ Marginal_Adhes    : int  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epit_Size         : int  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare_Nuclei       : int  1 10 2 4 1 10 10 1 1 1 ...
##  $ Bland_Chrom       : int  3 3 3 3 3 9 3 3 1 2 ...
##  $ Norm_Nucleoli     : int  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses           : int  1 1 1 1 1 1 1 1 5 1 ...
summary(wbc)
##      MALIG          Clump_Thick     Uniform_Cell_Size Uniform_Cell_Shape
##  Min.   :-1.0000   Min.   : 1.000   Min.   : 1.000    Min.   : 1.000    
##  1st Qu.:-1.0000   1st Qu.: 2.000   1st Qu.: 1.000    1st Qu.: 1.000    
##  Median :-1.0000   Median : 4.000   Median : 1.000    Median : 1.000    
##  Mean   :-0.3001   Mean   : 4.442   Mean   : 3.151    Mean   : 3.215    
##  3rd Qu.: 1.0000   3rd Qu.: 6.000   3rd Qu.: 5.000    3rd Qu.: 5.000    
##  Max.   : 1.0000   Max.   :10.000   Max.   :10.000    Max.   :10.000    
##  Marginal_Adhes    Epit_Size       Bare_Nuclei      Bland_Chrom    
##  Min.   : 1.00   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 1.00   1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median : 1.00   Median : 2.000   Median : 1.000   Median : 3.000  
##  Mean   : 2.83   Mean   : 3.234   Mean   : 3.545   Mean   : 3.445  
##  3rd Qu.: 4.00   3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000  
##  Max.   :10.00   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##  Norm_Nucleoli      Mitoses      
##  Min.   : 1.00   Min.   : 1.000  
##  1st Qu.: 1.00   1st Qu.: 1.000  
##  Median : 1.00   Median : 1.000  
##  Mean   : 2.87   Mean   : 1.603  
##  3rd Qu.: 4.00   3rd Qu.: 1.000  
##  Max.   :10.00   Max.   :10.000
# Check for missing values
colSums(is.na(wbc))
##              MALIG        Clump_Thick  Uniform_Cell_Size Uniform_Cell_Shape 
##                  0                  0                  0                  0 
##     Marginal_Adhes          Epit_Size        Bare_Nuclei        Bland_Chrom 
##                  0                  0                  0                  0 
##      Norm_Nucleoli            Mitoses 
##                  0                  0
# Store the true class labels for later validation
true_labels <- wbc$MALIG

# Check class distribution
table(true_labels)
## true_labels
##  -1   1 
## 444 239

Before we begin any clustering, we need to make sure our data is cleaned, normalized and we remove the MALIG column, this is because we want the clustering to be based on all the features that do not contain the complete answer about whether a data point is or is not malignant. We hope to gain this insight as a result of clustering rather so it is important that such columns are removed.

2. Data Preparation

# Extract only the 9 clinical features (remove MALIG column)
wbc_features <- wbc[, 2:10]

# Check the features
head(wbc_features)
summary(wbc_features)
##   Clump_Thick     Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes 
##  Min.   : 1.000   Min.   : 1.000    Min.   : 1.000     Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000    1st Qu.: 1.000     1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000    Median : 1.000     Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151    Mean   : 3.215     Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000    3rd Qu.: 5.000     3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000    Max.   :10.000     Max.   :10.00  
##    Epit_Size       Bare_Nuclei      Bland_Chrom     Norm_Nucleoli  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     Mitoses      
##  Min.   : 1.000  
##  1st Qu.: 1.000  
##  Median : 1.000  
##  Mean   : 1.603  
##  3rd Qu.: 1.000  
##  Max.   :10.000
# Standardize the data (z-score normalization)
wbc_features_z <- as.data.frame(lapply(wbc_features, scale))

# Verify standardization
summary(wbc_features_z)
##   Clump_Thick      Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes   
##  Min.   :-1.2203   Min.   :-0.7017   Min.   :-0.7412    Min.   :-0.6389  
##  1st Qu.:-0.8658   1st Qu.:-0.7017   1st Qu.:-0.7412    1st Qu.:-0.6389  
##  Median :-0.1568   Median :-0.7017   Median :-0.7412    Median :-0.6389  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000    Mean   : 0.0000  
##  3rd Qu.: 0.5523   3rd Qu.: 0.6033   3rd Qu.: 0.5972    3rd Qu.: 0.4084  
##  Max.   : 1.9703   Max.   : 2.2345   Max.   : 2.2702    Max.   : 2.5029  
##    Epit_Size        Bare_Nuclei       Bland_Chrom      Norm_Nucleoli    
##  Min.   :-1.0050   Min.   :-0.6983   Min.   :-0.9981   Min.   :-0.6125  
##  1st Qu.:-0.5552   1st Qu.:-0.6983   1st Qu.:-0.5899   1st Qu.:-0.6125  
##  Median :-0.5552   Median :-0.6983   Median :-0.1817   Median :-0.6125  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3444   3rd Qu.: 0.6738   3rd Qu.: 0.6347   3rd Qu.: 0.3703  
##  Max.   : 3.0434   Max.   : 1.7716   Max.   : 2.6758   Max.   : 2.3358  
##     Mitoses       
##  Min.   :-0.3481  
##  1st Qu.:-0.3481  
##  Median :-0.3481  
##  Mean   : 0.0000  
##  3rd Qu.:-0.3481  
##  Max.   : 4.8461

3. Clustering Tendency of Dataset

  1. To examine the tendancy of finding clusters in the dataset the hopkins statistic is useful metric.
  2. The dissimilarity matrix is another helpful tool for an early impression of the dataset clusterability .
# Hopkins statistic (values > 0.5 indicate clusterable data)
hopkins(wbc_features_z, n = nrow(wbc_features_z) - 1)
## $H
## [1] 0.2733698
# Visual assessment of clustering tendency
get_clust_tendency(wbc_features_z, n = 50, graph = TRUE, 
                   gradient = list(low = "red", mid = "white", high = "blue"))
## $hopkins_stat
## [1] 0.7123601
## 
## $plot

# Ordered Dissimilarity Matrix
dist_matrix <- dist(wbc_features_z, method = "euclidean")
fviz_dist(dist_matrix, show_labels = FALSE) + 
  labs(title = "Breast Cancer Data - Distance Matrix")

4. Attempting clustering with K-means

Inorder to identify what is the optimal number of clusters, I started by looking at elbow method before moving on to silhoutte method, gap statistic and Calinski Harabasz between a range of 2 to 15 clusters.

# Elbow method for k-means
fviz_nbclust(wbc_features_z, kmeans, method = "wss",k.max = 15) +
  labs(title = "Elbow Method - K-means")

# Silhouette method for k-means
fviz_nbclust(wbc_features_z, kmeans, method = "silhouette",k.max = 15) +
  labs(title = "Silhouette Method - K-means")

# Gap statistic
fviz_nbclust(wbc_features_z, 
             kmeans ,
             k.max =  15,
             method = "gap_stat")

# NbClust - multiple indices (Calinski-Harabasz)
nb <- NbClust(wbc_features_z, distance = "euclidean", 
              min.nc = 2, max.nc = 15, method = "kmeans", 
              index = "ch")
cat("\nNbClust recommendation (CH index):\n")
## 
## NbClust recommendation (CH index):
print(nb$Best.nc)
## Number_clusters     Value_Index 
##          2.0000        853.4126
# Optimal clusters using ClusterR package
opt_km <- Optimal_Clusters_KMeans(wbc_features_z, max_clusters = 15, 
                                   plot_clusters = TRUE, criterion = "silhouette")

The above results show us that K=2 is worth trying out for visualizing if using the k-means clustering algorithm.

# ============================================
# K-MEANS CLUSTERING
# ============================================

# Perform k-means with 2 clusters (expecting benign/malignant)
set.seed(123)
km_result <- kmeans(wbc_features_z, centers = 2, nstart = 25)

# Cluster sizes
km_result$size
## [1] 453 230
# Cluster centers
km_result$centers
##   Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes Epit_Size
## 1  -0.4971820        -0.6087925         -0.6045807     -0.5179091 -0.512503
## 2   0.9792324         1.1990564          1.1907612      1.0200557  1.009408
##   Bare_Nuclei Bland_Chrom Norm_Nucleoli    Mitoses
## 1  -0.5886887  -0.5484556    -0.5307637 -0.3073751
## 2   1.1594607   1.0802192     1.0453737  0.6053953
# Add cluster assignment to original data
wbc$cluster_km <- km_result$cluster

# Visualize clusters
fviz_cluster(km_result, data = wbc_features_z, 
             ellipse.type = "norm", geom = "point", 
             stand = FALSE, palette = "jco", ggtheme = theme_classic()) +
  labs(title = "K-means Clustering (k=2)")

# Compare with true labels
table(Predicted = km_result$cluster, Actual = true_labels)
##          Actual
## Predicted  -1   1
##         1 434  19
##         2  10 220
# Silhouette plot
sil_km <- silhouette(km_result$cluster, dist(wbc_features_z))
fviz_silhouette(sil_km) + labs(title = "Silhouette Plot - K-means")
##   cluster size ave.sil.width
## 1       1  453          0.75
## 2       2  230          0.23

# Calinski-Harabasz index
round(calinhara(wbc_features_z, km_result$cluster), digits = 2)
## [1] 853.41

The above graphs show that k=2 shows 2 different clusters without much seperation where one of the clusters is clearly a lot more compact compared to the other one which aligns with the sillhoutte width scores. Let us try using PAM next.

5. Attempting clustering with PAM

# Determine optimal k using silhouette method
fviz_nbclust(wbc_features_z, pam, method = "silhouette", k.max = 15) +
  labs(title = "Optimal Number of Clusters - PAM (Silhouette Method)")

# Determine optimal k using elbow method
fviz_nbclust(wbc_features_z, pam, method = "wss", k.max = 15) +
  labs(title = "Optimal Number of Clusters - PAM (Elbow Method)")

# Manual CH calculation for PAM
cat("\n=== Calinski-Harabasz Index (PAM) ===\n")
## 
## === Calinski-Harabasz Index (PAM) ===
ch_scores_pam <- numeric(9)
for (k in 2:15) {
  set.seed(123)
  pam_temp <- pam(wbc_features_z, k = k)
  ch_scores_pam[k-1] <- calinhara(wbc_features_z, pam_temp$clustering)
}


# Plot
plot(2:15, ch_scores_pam, type = "b", 
     xlab = "Number of Clusters (k)", 
     ylab = "Calinski-Harabasz Index",
     main = "PAM - Calinski-Harabasz Index by K",
     col = "purple", pch = 19, lwd = 2)
grid()

# Perform PAM with k=2
set.seed(123)
pam_k2 <- pam(wbc_features_z, k = 2)

cat("\n=== PAM with k=2 ===\n")
## 
## === PAM with k=2 ===
cat("Average Silhouette Width:", pam_k2$silinfo$avg.width, "\n")
## Average Silhouette Width: 0.571736
cat("Cluster sizes:", pam_k2$clusinfo[, 1], "\n")
## Cluster sizes: 450 233
# Visualize k=2
fviz_cluster(pam_k2, geom = "point", ellipse.type = "norm") +
  labs(title = "PAM Clustering (k = 2)")

  labs(title = "Silhouette Plot - PAM (k = 2)")
## <ggplot2::labels> List of 1
##  $ title: chr "Silhouette Plot - PAM (k = 2)"
fviz_silhouette(pam_k2) + 
  labs(title = "Silhouette Plot - PAM (k = 2)")
##   cluster size ave.sil.width
## 1       1  450          0.75
## 2       2  233          0.22

# Perform PAM with k=3
set.seed(123)
pam_k3 <- pam(wbc_features_z, k = 3)

cat("\n=== PAM with k=3 ===\n")
## 
## === PAM with k=3 ===
cat("Average Silhouette Width:", pam_k3$silinfo$avg.width, "\n")
## Average Silhouette Width: 0.5501258
cat("Cluster sizes:", pam_k3$clusinfo[, 1], "\n")
## Cluster sizes: 450 198 35
# Visualize k=3
fviz_cluster(pam_k3, geom = "point", ellipse.type = "norm") +
  labs(title = "PAM Clustering (k = 3)")

fviz_silhouette(pam_k3) + 
  labs(title = "Silhouette Plot - PAM (k = 3)")
##   cluster size ave.sil.width
## 1       1  450          0.73
## 2       2  198          0.18
## 3       3   35          0.26

PAM shows a little more better results for k=2, not only is it a more robust method because it is based on medoids, but we see a little more separation between the clusters.

6. Attempting clustering with Hierarchical Clustering

# Optimal number of clusters for hierarchical
fviz_nbclust(wbc_features_z, FUN = hcut, method = "wss") +
  labs(title = "Elbow Method - Hierarchical")

fviz_nbclust(wbc_features_z, FUN = hcut, method = "silhouette") +
  labs(title = "Silhouette Method - Hierarchical")

# Compute distance matrix
dist_matrix <- dist(wbc_features_z, method = "euclidean")

# Agglomerative - Complete linkage
hc_complete <- hclust(dist_matrix, method = "complete")
plot(hc_complete, cex = 0.6, hang = -1, 
     main = "Dendrogram - Complete Linkage")

# Agglomerative - Ward's method
hc_ward <- hclust(dist_matrix, method = "ward.D2")
plot(hc_ward, cex = 0.6, hang = -1, 
     main = "Dendrogram - Ward's Method")

# Cut tree into 2 groups
hc_clusters <- cutree(hc_ward, k = 2)

# Number of members in each cluster
table(hc_clusters)
## hc_clusters
##   1   2 
## 440 243
# Visualize cut
plot(hc_ward, cex = 0.6)
rect.hclust(hc_ward, k = 2, border = 2:3)

hc_clusters3 <- cutree(hc_ward, k = 3)
table(hc_clusters3)
## hc_clusters3
##   1   2   3 
## 440 207  36
# Cluster plot
fviz_cluster(list(data = wbc_features_z, cluster = hc_clusters)) +
  labs(title = "Hierarchical Clustering - Ward's Method k=2")

fviz_cluster(list(data = wbc_features_z, cluster = hc_clusters3)) +
  labs(title = "Hierarchical Clustering - Ward's Method k=3")

# Compare with true labels
table(Predicted = hc_clusters, Actual = true_labels)
##          Actual
## Predicted  -1   1
##         1 431   9
##         2  13 230
# Compare different linkage methods
methods <- c("average", "single", "complete", "ward")
ac_values <- sapply(methods, function(m) {
  agnes(wbc_features_z, method = m)$ac
})
cat("\nAgglomerative Coefficients:\n")
## 
## Agglomerative Coefficients:
print(ac_values)
##   average    single  complete      ward 
## 0.8896490 0.7915542 0.9241651 0.9906957
# AGNES (Agglomerative Nesting) - use "ward" not "ward.D2"
hc_agnes <- agnes(wbc_features_z, method = "ward")
pltree(hc_agnes, cex = 0.6, hang = -1, 
       main = "Dendrogram - AGNES")

cat("Agglomerative coefficient:", hc_agnes$ac, "\n")
## Agglomerative coefficient: 0.9906957
# AGNES (Agglomerative Nesting)
hc_agnes <- agnes(wbc_features_z, method = "ward")
pltree(hc_agnes, cex = 0.6, hang = -1, 
       main = "Dendrogram - AGNES")
cat("Agglomerative coefficient:", hc_agnes$ac, "\n")
## Agglomerative coefficient: 0.9906957
# DIANA (Divisive Analysis)
hc_diana <- diana(wbc_features_z)
pltree(hc_diana, cex = 0.6, hang = -1, 
       main = "Dendrogram - DIANA")

cat("Divisive coefficient:", hc_diana$dc, "\n")
## Divisive coefficient: 0.9171601

Hierarchical clustering yielded inferior results compared to PAM, with lower average silhouette widths indicating weaker cluster cohesion. The dendrograms across multiple linkage methods (Ward’s, Complete, Average) exhibited higher inter-cluster overlap, suggesting that hierarchical agglomeration struggles with the subtle boundaries between benign and malignant cases in this dataset.

7. Attempting clustering with DBSCAN

# Determine optimal eps using k-NN distance plot
dbscan::kNNdistplot(wbc_features_z, k = 5)
abline(h = 1, lty = 2, col = "red")  # Adjust this value based on plot

# Perform DBSCAN
set.seed(123)
db_result <- fpc::dbscan(wbc_features_z, eps = 1.5, MinPts = 8)

# Print results
print(db_result)
## dbscan Pts=683 MinPts=8 eps=1.5
##          0   1 2 3
## border 230  12 7 7
## seed     0 425 1 1
## total  230 437 8 8
# Plot DBSCAN results
plot(db_result, wbc_features_z, main = "DBSCAN", frame = FALSE)

# Visualize with factoextra
fviz_cluster(db_result, wbc_features_z, stand = FALSE, 
             frame = FALSE, geom = "point") +
  labs(title = "DBSCAN Clustering")

# Check cluster membership (0 = noise/outlier)
table(db_result$cluster)
## 
##   0   1   2   3 
## 230 437   8   8

Even though, DBSCAN yields us the most seperated clusters, it also introduces a lot of noise and the number of clusters produces was very volatile to change in minPoints and eps values resulting in 1 giant cluster.

8. Interpretations and Findings of Clusters

# Create a comprehensive summary for ALL points
cluster_summary <- data.frame(
  Cluster = pam_k2$clustering,
  MALIG = true_labels
)

# Count breakdown for each cluster
cat("=== Complete Breakdown - All Points ===\n\n")
## === Complete Breakdown - All Points ===
for (i in 1:2) {
  cluster_points <- cluster_summary[cluster_summary$Cluster == i, ]
  benign <- sum(cluster_points$MALIG == -1)
  malignant <- sum(cluster_points$MALIG == 1)
  total <- nrow(cluster_points)
  
  cat("Cluster", i, ":\n")
  cat("  Total points:", total, "\n")
  cat("  Benign (-1):", benign, "\n")
  cat("  Malignant (+1):", malignant, "\n")
  cat("  % Benign:", round(benign/total*100, 2), "%\n")
  cat("  % Malignant:", round(malignant/total*100, 2), "%\n\n")
}
## Cluster 1 :
##   Total points: 450 
##   Benign (-1): 435 
##   Malignant (+1): 15 
##   % Benign: 96.67 %
##   % Malignant: 3.33 %
## 
## Cluster 2 :
##   Total points: 233 
##   Benign (-1): 9 
##   Malignant (+1): 224 
##   % Benign: 3.86 %
##   % Malignant: 96.14 %
# ============================================
# FEATURE VALUE RANGES BY CLUSTER
# ============================================

cat("\n=== Feature Value Ranges by Cluster ===\n\n")
## 
## === Feature Value Ranges by Cluster ===
# Add cluster assignment to original data
wbc_with_cluster <- wbc
wbc_with_cluster$cluster <- pam_k2$clustering

for (i in 1:2) {
  cat("CLUSTER", i, ":\n")
  cluster_data <- wbc_with_cluster[wbc_with_cluster$cluster == i, 2:10]
  
  # Calculate statistics for each feature
  cat("Feature               Min    Q1   Median   Q3    Max   Mean\n")
  cat("---------------------------------------------------------------\n")
  
  for (feat in colnames(cluster_data)) {
    stats <- summary(cluster_data[[feat]])
    cat(sprintf("%-20s %4.1f  %4.1f  %4.1f  %4.1f  %4.1f  %4.1f\n",
                feat,
                stats[1], stats[2], stats[3], stats[5], stats[6], stats[4]))
  }
  cat("\n")
}
## CLUSTER 1 :
## Feature               Min    Q1   Median   Q3    Max   Mean
## ---------------------------------------------------------------
## Clump_Thick           1.0   1.0   3.0   4.0  10.0   3.0
## Uniform_Cell_Size     1.0   1.0   1.0   1.0   5.0   1.3
## Uniform_Cell_Shape    1.0   1.0   1.0   1.0   8.0   1.4
## Marginal_Adhes        1.0   1.0   1.0   1.0  10.0   1.3
## Epit_Size             1.0   2.0   2.0   2.0  10.0   2.1
## Bare_Nuclei           1.0   1.0   1.0   1.0  10.0   1.3
## Bland_Chrom           1.0   1.0   2.0   3.0   7.0   2.1
## Norm_Nucleoli         1.0   1.0   1.0   1.0   9.0   1.2
## Mitoses               1.0   1.0   1.0   1.0   8.0   1.1
## 
## CLUSTER 2 :
## Feature               Min    Q1   Median   Q3    Max   Mean
## ---------------------------------------------------------------
## Clump_Thick           1.0   5.0   8.0  10.0  10.0   7.2
## Uniform_Cell_Size     1.0   4.0   7.0  10.0  10.0   6.8
## Uniform_Cell_Shape    1.0   5.0   7.0   9.0  10.0   6.7
## Marginal_Adhes        1.0   3.0   5.0   8.0  10.0   5.7
## Epit_Size             2.0   4.0   5.0   7.0  10.0   5.5
## Bare_Nuclei           1.0   5.0  10.0  10.0  10.0   7.9
## Bland_Chrom           1.0   4.0   7.0   8.0  10.0   6.1
## Norm_Nucleoli         1.0   3.0   6.0  10.0  10.0   6.0
## Mitoses               1.0   1.0   1.0   3.0  10.0   2.6
# Compare means between clusters
cat("\n=== Mean Feature Values by Cluster ===\n")
## 
## === Mean Feature Values by Cluster ===
cluster_means <- aggregate(wbc[, 2:10], 
                           by = list(Cluster = pam_k2$clustering), 
                           FUN = mean)
print(round(cluster_means, 2))
##   Cluster Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape Marginal_Adhes
## 1       1        3.03              1.28               1.41           1.34
## 2       2        7.18              6.76               6.70           5.70
##   Epit_Size Bare_Nuclei Bland_Chrom Norm_Nucleoli Mitoses
## 1      2.08        1.30        2.08          1.24    1.09
## 2      5.45        7.87        6.08          6.01    2.59
# Difference in means (Cluster 1 - Cluster 2)
cat("\n=== Difference in Means (Cluster 1 - Cluster 2) ===\n")
## 
## === Difference in Means (Cluster 1 - Cluster 2) ===
mean_diff <- cluster_means[1, -1] - cluster_means[2, -1]
mean_diff_sorted <- sort(abs(as.numeric(mean_diff)), decreasing = TRUE, index.return = TRUE)
cat("\nFeatures ranked by difference:\n")
## 
## Features ranked by difference:
for (i in 1:9) {
  feat_idx <- mean_diff_sorted$ix[i]
  feat_name <- names(mean_diff)[feat_idx]
  diff_val <- mean_diff[feat_idx]
  cat(sprintf("%d. %-25s Difference: %6.2f\n", i, feat_name, diff_val))
}
## 1. Bare_Nuclei               Difference:  -6.57
## 2. Uniform_Cell_Size         Difference:  -5.47
## 3. Uniform_Cell_Shape        Difference:  -5.28
## 4. Norm_Nucleoli             Difference:  -4.77
## 5. Marginal_Adhes            Difference:  -4.36
## 6. Clump_Thick               Difference:  -4.15
## 7. Bland_Chrom               Difference:  -4.00
## 8. Epit_Size                 Difference:  -3.37
## 9. Mitoses                   Difference:  -1.49
# ============================================
# TYPICAL VALUE RANGES FOR CLASSIFICATION
# ============================================

cat("\n=== Typical Ranges for Benign vs Malignant ===\n\n")
## 
## === Typical Ranges for Benign vs Malignant ===
# Identify which cluster is benign/malignant
cluster1_malig_pct <- sum(wbc_with_cluster$cluster == 1 & wbc_with_cluster$MALIG == 1) / 
                      sum(wbc_with_cluster$cluster == 1)

if (cluster1_malig_pct > 0.5) {
  malig_cluster <- 1
  benign_cluster <- 2
} else {
  malig_cluster <- 2
  benign_cluster <- 1
}

cat("Benign Cluster:", benign_cluster, "\n")
## Benign Cluster: 1
cat("Malignant Cluster:", malig_cluster, "\n\n")
## Malignant Cluster: 2
# Show typical ranges
benign_data <- wbc_with_cluster[wbc_with_cluster$cluster == benign_cluster, 2:10]
malig_data <- wbc_with_cluster[wbc_with_cluster$cluster == malig_cluster, 2:10]

cat("BENIGN TUMORS - Typical Ranges (25th-75th percentile):\n")
## BENIGN TUMORS - Typical Ranges (25th-75th percentile):
cat("Feature               Q1-Q3 Range       Median\n")
## Feature               Q1-Q3 Range       Median
cat("----------------------------------------------------\n")
## ----------------------------------------------------
for (feat in colnames(benign_data)) {
  q1 <- quantile(benign_data[[feat]], 0.25)
  q3 <- quantile(benign_data[[feat]], 0.75)
  med <- median(benign_data[[feat]])
  cat(sprintf("%-20s %4.1f - %4.1f        %4.1f\n", feat, q1, q3, med))
}
## Clump_Thick           1.0 -  4.0         3.0
## Uniform_Cell_Size     1.0 -  1.0         1.0
## Uniform_Cell_Shape    1.0 -  1.0         1.0
## Marginal_Adhes        1.0 -  1.0         1.0
## Epit_Size             2.0 -  2.0         2.0
## Bare_Nuclei           1.0 -  1.0         1.0
## Bland_Chrom           1.0 -  3.0         2.0
## Norm_Nucleoli         1.0 -  1.0         1.0
## Mitoses               1.0 -  1.0         1.0
cat("\nMALIGNANT TUMORS - Typical Ranges (25th-75th percentile):\n")
## 
## MALIGNANT TUMORS - Typical Ranges (25th-75th percentile):
cat("Feature               Q1-Q3 Range       Median\n")
## Feature               Q1-Q3 Range       Median
cat("----------------------------------------------------\n")
## ----------------------------------------------------
for (feat in colnames(malig_data)) {
  q1 <- quantile(malig_data[[feat]], 0.25)
  q3 <- quantile(malig_data[[feat]], 0.75)
  med <- median(malig_data[[feat]])
  cat(sprintf("%-20s %4.1f - %4.1f        %4.1f\n", feat, q1, q3, med))
}
## Clump_Thick           5.0 - 10.0         8.0
## Uniform_Cell_Size     4.0 - 10.0         7.0
## Uniform_Cell_Shape    5.0 -  9.0         7.0
## Marginal_Adhes        3.0 -  8.0         5.0
## Epit_Size             4.0 -  7.0         5.0
## Bare_Nuclei           5.0 - 10.0        10.0
## Bland_Chrom           4.0 -  8.0         7.0
## Norm_Nucleoli         3.0 - 10.0         6.0
## Mitoses               1.0 -  3.0         1.0
cat("=== Confusion Matrix ===\n")
## === Confusion Matrix ===
confusion_matrix <- table(Predicted = pam_k2$clustering, Actual = true_labels)
print(confusion_matrix)
##          Actual
## Predicted  -1   1
##         1 435  15
##         2   9 224
# ============================================
# VISUALIZATIONS
# ============================================

# Percentage stacked bar plot
cluster_pct <- prop.table(table(cluster_summary$Cluster, cluster_summary$MALIG), margin = 1) * 100
barplot(t(cluster_pct), 
        col = c("lightblue", "lightcoral"),
        main = "Cluster Composition: Percentage Distribution",
        xlab = "Cluster",
        ylab = "Percentage (%)",
        legend = c("Benign (-1)", "Malignant (+1)"),
        args.legend = list(x = "topright", title = "MALIG Status"),
        ylim = c(0, 100))

# Add percentage labels
for (i in 1:2) {
  text(x = (i-1)*1.2 + 0.7, y = cluster_pct[i,1]/2, 
       labels = paste0(round(cluster_pct[i,1], 1), "%"), 
       col = "darkblue", font = 2, cex = 1.1)
  text(x = (i-1)*1.2 + 0.7, y = cluster_pct[i,1] + cluster_pct[i,2]/2, 
       labels = paste0(round(cluster_pct[i,2], 1), "%"), 
       col = "darkred", font = 2, cex = 1.1)
}

The above graph show us that we can draw some meaningful insights. The data can be grouped into 2 clusters and this makes sense as the entire dataset contains features for aiding in breast cancer detection. On taking a deeper look at values of the MALIG column inside both clusters it is very evident that cluster 1 is predominantly Benign with a few exceptions while cluster 2 is Malignant. This is actually an amazing finding because without any labels the values of the remaining 9 dimensions were useful enough to make this classification highly accurate.

The Feature Value Ranges tables highlight the key differences between the two clusters which is aligned with cytological findings.

  • Uniformity of Cell Size and Shape: Cluster 1 shows median values of 1.0, indicating well-formed, uniform cells, whereas Cluster 2 has higher medians of 7.0 for size and 7.0 for shape, reflecting disorganized and irregular cell structures typical of malignancy.

  • Bare Nuclei: There’s a striking difference in Bare Nuclei, with Cluster 1 showing a low mean of 1.3, while Cluster 2 has a mean of 7.9, suggesting the presence of abnormal, absent nuclei, which is more common in malignant cases.

  • Mitoses: Mitoses are almost absent in Cluster 1 with a mean of 1.1, while Cluster 2 shows a significantly higher mean of 2.6, indicating more active cell division, a key indicator of malignancy.

Lastly, after trying many different clustering algorithm approaches, PAM showed the best results due to the highest silhouette scores for k=2 along with visually having higher inter cluster separation.

Dimension Reduction Analysis

# ============================================
# DIMENSION REDUCTION - MDS & PCA
# ============================================

# Prepare data
wbc_features_z <- as.data.frame(lapply(wbc[, 2:10], scale))

1. Correlation Analysis

# ============================================
# CORRELATION ANALYSIS
# ============================================

# Check if variables are correlated (necessary for dimension reduction)
wbc_cor <- cor(wbc[, 2:10], method = "pearson")
print(round(wbc_cor, 2))
##                    Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape
## Clump_Thick               1.00              0.64               0.65
## Uniform_Cell_Size         0.64              1.00               0.91
## Uniform_Cell_Shape        0.65              0.91               1.00
## Marginal_Adhes            0.49              0.71               0.69
## Epit_Size                 0.52              0.75               0.72
## Bare_Nuclei               0.59              0.69               0.71
## Bland_Chrom               0.55              0.76               0.74
## Norm_Nucleoli             0.53              0.72               0.72
## Mitoses                   0.35              0.46               0.44
##                    Marginal_Adhes Epit_Size Bare_Nuclei Bland_Chrom
## Clump_Thick                  0.49      0.52        0.59        0.55
## Uniform_Cell_Size            0.71      0.75        0.69        0.76
## Uniform_Cell_Shape           0.69      0.72        0.71        0.74
## Marginal_Adhes               1.00      0.59        0.67        0.67
## Epit_Size                    0.59      1.00        0.59        0.62
## Bare_Nuclei                  0.67      0.59        1.00        0.68
## Bland_Chrom                  0.67      0.62        0.68        1.00
## Norm_Nucleoli                0.60      0.63        0.58        0.67
## Mitoses                      0.42      0.48        0.34        0.35
##                    Norm_Nucleoli Mitoses
## Clump_Thick                 0.53    0.35
## Uniform_Cell_Size           0.72    0.46
## Uniform_Cell_Shape          0.72    0.44
## Marginal_Adhes              0.60    0.42
## Epit_Size                   0.63    0.48
## Bare_Nuclei                 0.58    0.34
## Bland_Chrom                 0.67    0.35
## Norm_Nucleoli               1.00    0.43
## Mitoses                     0.43    1.00
# Visualize correlation matrix
corrplot(wbc_cor)

The correlation matrix shows strong positive relationships among most features in the dataset, particularly between Uniform Cell Size, Uniform Cell Shape, Bare Nuclei, Bland Chromatin, and Normal Nucleoli, indicating that these variables capture closely related aspects of cellular abnormality. Clump Thickness and Epithelial Cell Size also exhibit moderate to strong correlations.

In contrast, Mitoses shows comparatively weaker correlations with other variables, suggesting that it represents a more independent source of variation. But overall this is a good sign that dimension reduction will be possible with the dataset!

2. Attempting Dimension reduction with PCA

# Standard PCA
pca_result <- prcomp(wbc_features_z, center = FALSE, scale. = FALSE)
summary(pca_result)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
## Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
## Cumulative Proportion  0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
##                            PC8     PC9
## Standard deviation     0.51062 0.29729
## Proportion of Variance 0.02897 0.00982
## Cumulative Proportion  0.99018 1.00000
# Scree plot - variance explained
fviz_eig(pca_result, addlabels = TRUE, 
         main = "Scree Plot - Variance Explained by Each PC")

# Check eigenvalues
eig_val <- get_eigenvalue(pca_result)
print(eig_val)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.89949935       65.5499928                    65.54999
## Dim.2 0.77594689        8.6216321                    74.17162
## Dim.3 0.53925224        5.9916916                    80.16332
## Dim.4 0.45962745        5.1069717                    85.27029
## Dim.5 0.38027583        4.2252870                    89.49558
## Dim.6 0.30187645        3.3541828                    92.84976
## Dim.7 0.29440271        3.2711413                    96.12090
## Dim.8 0.26073586        2.8970651                    99.01796
## Dim.9 0.08838322        0.9820358                   100.00000
# How many PCs explain 80% of variance?
cumsum(eig_val$variance.percent)
## [1]  65.54999  74.17162  80.16332  85.27029  89.49558  92.84976  96.12090
## [8]  99.01796 100.00000
# Biplot - variables and observations
fviz_pca_biplot(pca_result, 
                geom.ind = "point",
                col.ind = ifelse(true_labels == -1, "blue", "red"),
                col.var = "black",
                repel = TRUE,
                title = "PCA Biplot - Breast Cancer Data")

# Variable contributions to PC1 and PC2
fviz_contrib(pca_result, choice = "var", axes = 1, top = 9,
             title = "Contribution to PC1")

fviz_contrib(pca_result, choice = "var", axes = 2, top = 9,
             title = "Contribution to PC2")

# Show loadings for first 2 PCs
cat("\nLoadings for PC1 and PC2:\n")
## 
## Loadings for PC1 and PC2:
print(pca_result$rotation[, 1:2])
##                           PC1         PC2
## Clump_Thick        -0.3020626 -0.14080053
## Uniform_Cell_Size  -0.3807930 -0.04664031
## Uniform_Cell_Shape -0.3775825 -0.08242247
## Marginal_Adhes     -0.3327236 -0.05209438
## Epit_Size          -0.3362340  0.16440439
## Bare_Nuclei        -0.3350675 -0.26126062
## Bland_Chrom        -0.3457474 -0.22807676
## Norm_Nucleoli      -0.3355914  0.03396582
## Mitoses            -0.2302064  0.90555729

From the plot of the variables we can see the Mitoses is the only feature that is the quite isolated which was consistent with our correlation matrix.

According to Kaiser rule we only need to choose the first principal component since its the only component with eigenvalue > 1. But I want to aim to choose the least number of components that can explain ~75% of cumalitive variance in my dataset. Since PC1 can only explain 65.5% on its own, I believe we need to also include PC2.

Exploring rotated PCA for easier interpretation.

# PCA with varimax rotation for easier interpretation
library(psych)
pca_rotated <- principal(wbc_features_z, nfactors = 2, rotate = "varimax")
print(pca_rotated)
## Principal Components Analysis
## Call: principal(r = wbc_features_z, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                     RC1  RC2   h2    u2 com
## Clump_Thick        0.72 0.18 0.55 0.446 1.1
## Uniform_Cell_Size  0.87 0.33 0.86 0.143 1.3
## Uniform_Cell_Shape 0.87 0.30 0.85 0.154 1.2
## Marginal_Adhes     0.76 0.28 0.66 0.345 1.3
## Epit_Size          0.69 0.46 0.69 0.312 1.7
## Bare_Nuclei        0.84 0.11 0.72 0.285 1.0
## Bland_Chrom        0.85 0.15 0.75 0.254 1.1
## Norm_Nucleoli      0.74 0.35 0.67 0.335 1.4
## Mitoses            0.20 0.95 0.95 0.051 1.1
## 
##                        RC1  RC2
## SS loadings           5.09 1.58
## Proportion Var        0.57 0.18
## Cumulative Var        0.57 0.74
## Proportion Explained  0.76 0.24
## Cumulative Proportion 0.76 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  123.18  with prob <  2.8e-17 
## 
## Fit based upon off diagonal values = 0.99
# Print significant loadings only
cat("\nSignificant Loadings (cutoff = 0.4):\n")
## 
## Significant Loadings (cutoff = 0.4):
print(loadings(pca_rotated), digits = 3, cutoff = 0.4, sort = TRUE)
## 
## Loadings:
##                    RC1   RC2  
## Clump_Thick        0.723      
## Uniform_Cell_Size  0.865      
## Uniform_Cell_Shape 0.871      
## Marginal_Adhes     0.760      
## Epit_Size          0.692 0.457
## Bare_Nuclei        0.838      
## Bland_Chrom        0.851      
## Norm_Nucleoli      0.736      
## Mitoses                  0.954
## 
##                  RC1   RC2
## SS loadings    5.095 1.581
## Proportion Var 0.566 0.176
## Cumulative Var 0.566 0.742
# Quality measures
cat("\nComplexity (lower is better):\n")
## 
## Complexity (lower is better):
print(pca_rotated$complexity)
##        Clump_Thick  Uniform_Cell_Size Uniform_Cell_Shape     Marginal_Adhes 
##           1.119448           1.282949           1.229359           1.263172 
##          Epit_Size        Bare_Nuclei        Bland_Chrom      Norm_Nucleoli 
##           1.731604           1.035236           1.060832           1.431011 
##            Mitoses 
##           1.085271
cat("\nUniqueness (lower is better):\n")
## 
## Uniqueness (lower is better):
print(pca_rotated$uniqueness)
##        Clump_Thick  Uniform_Cell_Size Uniform_Cell_Shape     Marginal_Adhes 
##         0.44633609         0.14286526         0.15364542         0.34479029 
##          Epit_Size        Bare_Nuclei        Bland_Chrom      Norm_Nucleoli 
##         0.31206905         0.28469792         0.25440254         0.33469392 
##            Mitoses 
##         0.05105328
# Plot complexity vs uniqueness
plot(pca_rotated$complexity, pca_rotated$uniqueness,
     main = "Complexity vs Uniqueness",
     xlab = "Complexity", ylab = "Uniqueness",
     pch = 19, col = "steelblue")
text(pca_rotated$complexity, pca_rotated$uniqueness,
     labels = names(pca_rotated$complexity), pos = 3, cex = 0.7)
abline(h = 0.5, v = 1.5, lty = 2, col = "red")

With the 2 rotated components, the loadings reveal a clear interpretative structure. RC1 captures general cellular abnormality, with high loadings across morphological features (Uniform_Cell_Shape 0.871, Uniform_Cell_Size 0.865, Bland_Chrom 0.851). RC2 on the other hand on cell division activity as it isolates mitotic activity (Mitoses 0.954).

3. Attempting Dimension Reduction with MDS

print(round(wbc_cor, 2))
##                    Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape
## Clump_Thick               1.00              0.64               0.65
## Uniform_Cell_Size         0.64              1.00               0.91
## Uniform_Cell_Shape        0.65              0.91               1.00
## Marginal_Adhes            0.49              0.71               0.69
## Epit_Size                 0.52              0.75               0.72
## Bare_Nuclei               0.59              0.69               0.71
## Bland_Chrom               0.55              0.76               0.74
## Norm_Nucleoli             0.53              0.72               0.72
## Mitoses                   0.35              0.46               0.44
##                    Marginal_Adhes Epit_Size Bare_Nuclei Bland_Chrom
## Clump_Thick                  0.49      0.52        0.59        0.55
## Uniform_Cell_Size            0.71      0.75        0.69        0.76
## Uniform_Cell_Shape           0.69      0.72        0.71        0.74
## Marginal_Adhes               1.00      0.59        0.67        0.67
## Epit_Size                    0.59      1.00        0.59        0.62
## Bare_Nuclei                  0.67      0.59        1.00        0.68
## Bland_Chrom                  0.67      0.62        0.68        1.00
## Norm_Nucleoli                0.60      0.63        0.58        0.67
## Mitoses                      0.42      0.48        0.34        0.35
##                    Norm_Nucleoli Mitoses
## Clump_Thick                 0.53    0.35
## Uniform_Cell_Size           0.72    0.46
## Uniform_Cell_Shape          0.72    0.44
## Marginal_Adhes              0.60    0.42
## Epit_Size                   0.63    0.48
## Bare_Nuclei                 0.58    0.34
## Bland_Chrom                 0.67    0.35
## Norm_Nucleoli               1.00    0.43
## Mitoses                     0.43    1.00
# Convert correlation matrix to distance matrix (dissimilarity)
library(smacof)
library(ape)
dis2 <- sim2diss(wbc_cor, method = 1, to.dist = TRUE)
# Check the distance matrix
print(dis2)
##                    Clump_Thick Uniform_Cell_Size Uniform_Cell_Shape
## Uniform_Cell_Size   0.35751851                                     
## Uniform_Cell_Shape  0.34653001        0.09277177                   
## Marginal_Adhes      0.51217128        0.29302305         0.31405194
## Epit_Size           0.47640396        0.24645598         0.27753759
## Bare_Nuclei         0.40690856        0.30829125         0.28612245
## Bland_Chrom         0.44625755        0.24444084         0.26465650
## Norm_Nucleoli       0.46593409        0.28065396         0.28203659
## Mitoses             0.64904283        0.53924530         0.55874242
##                    Marginal_Adhes  Epit_Size Bare_Nuclei Bland_Chrom
## Uniform_Cell_Size                                                   
## Uniform_Cell_Shape                                                  
## Marginal_Adhes                                                      
## Epit_Size              0.40545223                                   
## Bare_Nuclei            0.32935171 0.41428387                        
## Bland_Chrom            0.33143294 0.38187210  0.31938514            
## Norm_Nucleoli          0.39687894 0.37107360  0.41571980  0.33439847
## Mitoses                0.58110167 0.51941670  0.66078956  0.65398911
##                    Norm_Nucleoli
## Uniform_Cell_Size               
## Uniform_Cell_Shape              
## Marginal_Adhes                  
## Epit_Size                       
## Bare_Nuclei                     
## Bland_Chrom                     
## Norm_Nucleoli                   
## Mitoses               0.56624273
# Mantel test – for similarity of matrices
# H0 - the compared matrices are random, thus they are different 
# H1 - the non-random pattern, interpreted as a similarity of the matrices
mantel_result <- mantel.test(as.matrix(wbc_cor), as.matrix(dis2))
print(mantel_result) 
## $z.stat
## [1] 8.016593
## 
## $p
## [1] 0.001
## 
## $alternative
## [1] "two.sided"
# Perform MDS using the smacof package
fit.data <- mds(dis2, ndim = 2, type = "ratio")

# Print the MDS results summary
summary(fit.data)
## 
## Configurations:
##                         D1      D2
## Clump_Thick        -0.2568  0.8022
## Uniform_Cell_Size  -0.0377 -0.0085
## Uniform_Cell_Shape -0.1165  0.0909
## Marginal_Adhes     -0.2457 -0.6323
## Epit_Size           0.3465  0.2775
## Bare_Nuclei        -0.6620  0.1468
## Bland_Chrom        -0.4914 -0.2674
## Norm_Nucleoli       0.2690 -0.4493
## Mitoses             1.1945  0.0401
## 
## 
## Stress per point (in %):
##        Clump_Thick  Uniform_Cell_Size Uniform_Cell_Shape     Marginal_Adhes 
##              11.51               2.31               3.22              16.31 
##          Epit_Size        Bare_Nuclei        Bland_Chrom      Norm_Nucleoli 
##              13.57              11.08              11.19              15.23 
##            Mitoses 
##              15.57
# Simple MDS plot
plot(fit.data, main = "MDS - Distance Matrix", xlab = "Dimension 1", ylab = "Dimension 2")

stress_values <- c()

# Loop over the number of dimensions (1 to 9)
for (k in 1:8) {
  # Apply MDS with 'k' dimensions
  fit_data <- mds(dis2, ndim = k, type = "ratio")
  
  # Extract the stress value
  stress_values <- c(stress_values, fit_data$stress)
  
  # Print the stress for the current dimension
  cat("Stress for", k, "dimensions:", fit_data$stress, "\n")
}
## Stress for 1 dimensions: 0.3997401 
## Stress for 2 dimensions: 0.1834531 
## Stress for 3 dimensions: 0.106796 
## Stress for 4 dimensions: 0.06353847 
## Stress for 5 dimensions: 0.04379878 
## Stress for 6 dimensions: 0.02140813 
## Stress for 7 dimensions: 0.006328878 
## Stress for 8 dimensions: 5.676903e-16
# Plot stress vs. number of dimensions
plot(1:8, stress_values, type = "b", 
     xlab = "Number of Dimensions", ylab = "Stress", 
     main = "Stress vs. Number of Dimensions",
     col = "blue", pch = 19)

# Generate random stress baseline
set.seed(123)
stressvec <- randomstress(n = 9, ndim = 2, nrep = 100) 
random_stress_mean <- mean(stressvec)

cat("\n=== MDS Quality Assessment ===\n")
## 
## === MDS Quality Assessment ===
cat("Random stress (baseline):", random_stress_mean, "\n")
## Random stress (baseline): 0.3115778
cat("Empirical stress (2D MDS):", fit.data$stress, "\n")
## Empirical stress (2D MDS): 0.1834531
# Quality ratio (lower is better)
stress_ratio <- fit.data$stress / random_stress_mean
cat("Stress ratio (empirical/random):", stress_ratio, "\n\n")
## Stress ratio (empirical/random): 0.5887873
# Interpretation
cat("Your 2D stress:", round(fit.data$stress, 3), "=", 
    ifelse(fit.data$stress > 0.20, "VERY POOR",
    ifelse(fit.data$stress > 0.10, "POOR",
    ifelse(fit.data$stress > 0.05, "FAIR",
    ifelse(fit.data$stress > 0.025, "GOOD", "EXCELLENT")))), "\n")
## Your 2D stress: 0.183 = POOR

The Mantel test confirms valid similarity-to-dissimilarity conversion (z=8.02, p=0.001).

The stress ratio of 0.589 indicates that the 2D MDS configuration is 41% better than a random arrangement, falling in the moderate fit range. While this confirms the solution captures some meaningful distance structure rather than random noise, the moderate improvement (ratio > 0.5) suggests the 2D representation is not ideal.

MDS requires 4 or 5 dimensions to achieve ‘good’ stress values (<0.05) according to Kruskal’s benchmarks as the 2 dimension MDS solution yields a stress of 0.183 which is considered closer to ‘poor’ than ‘fair’.

This indicates that PCA might be the superior dimensionality reduction method for this dataset. With just 2 principal components, PCA retains 74.1% of the variance, meeting the ~75% threshold I was looking for.

library(ggfortify)

wbc$Class <- factor(ifelse(true_labels == -1, "Benign", "Malignant"))
autoplot(
  pca_result,
  data = wbc,
  colour = "Class",
  loadings = TRUE,
  loadings.label = TRUE,
  loadings.label.size = 3,
  loadings.colour = "darkgreen"
) +
  scale_colour_manual(values = c("Benign" = "blue", "Malignant" = "red")) +
  labs(
    title = "PCA - Breast Cancer Features with Loadings",
    colour = "True Class"
  )

wbc$Cluster <- factor(pam_k2$clustering,
                      labels = c("Cluster 1", "Cluster 2"))
autoplot(
  pca_result,
  data = wbc,
  colour = "Cluster",
  loadings = TRUE,
  loadings.label = TRUE,
  loadings.label.size = 3,
  loadings.colour = "darkgreen"
) +
  scale_colour_manual(values = c("Cluster 1" = "blue", "Cluster 2" = "red")) +
  labs(
    title = "PCA - Breast Cancer Features with Loadings (PAM Clusters)",
    colour = "PAM Cluster"
  )

The PCA biplots show that the first two components explain approximately 74% of the total variance, with a clear separation of observations along PC1. Variables related to cell size, shape, and nuclear features contribute most strongly to this separation, while mitoses primarily influence PC2.

Coloring by true labels and PAM clusters results in nearly identical groupings. This confirms that PCA preserves the information most relevant for clustering in a low-dimensional space.

4. Running clustering again after dimension reduction

pca_2d <- pca_result$x[, 1:2]
fviz_nbclust(pca_2d, pam, method = "silhouette", k.max = 15) +
  labs(title = "Optimal Number of Clusters - PAM after PCA")

# PAM on PCA results (k=2)
pam_pca <- pam(pca_2d, k = 2)

# Visualize PAM on PCA
fviz_cluster(pam_pca, data = pca_2d,
             ellipse.type = "norm", geom = "point",
             palette = "jco") +
  labs(title = "PAM (k=2) on PCA-reduced Data")

# Silhouette plot for PAM on PCA
fviz_silhouette(pam_pca) + 
  labs(title = "Silhouette Plot on PCA applied data")
##   cluster size ave.sil.width
## 1       1  442          0.88
## 2       2  241          0.47

# Compare PAM with true labels
cat("\n=== PAM on PCA vs True Labels ===\n")
## 
## === PAM on PCA vs True Labels ===
table(Predicted = pam_pca$clustering, Actual = true_labels)
##          Actual
## Predicted  -1   1
##         1 433   9
##         2  11 230
cat("\nPAM on PCA - Average Silhouette Width:", 
    round(pam_pca$silinfo$avg.width, 4), "\n")
## 
## PAM on PCA - Average Silhouette Width: 0.7346
# Summary comparison
cat("\n=== CLUSTERING ON REDUCED DIMENSIONS SUMMARY ===\n")
## 
## === CLUSTERING ON REDUCED DIMENSIONS SUMMARY ===
cat("PAM on PCA - Silhouette:", 
    round(pam_pca$silinfo$avg.width, 4), "\n")
## PAM on PCA - Silhouette: 0.7346

Running PAM on top of PCA based clustering resulted in slightly more accurate results as the silhouette widths on both clusters have increased compared to running clustering without dimension reduction. The optimal number of clusters still being 2 compared to when clustering was ran with all 9 dimensions show that we were able to effectively reduce the number of dimensions from 9 to 2 which is quite effective!

Lastly on comparing the classification results of clustering vs true labels of the data points before and after dimension reduction, it is safe to say that accuracy of the classification overall improved marginally.

Conclusions

This analysis successfully demonstrated the efficacy of unsupervised learning techniques in discerning the underlying structure on breast cancer related features.

The clustering analysis effectively partitioned the data into two distinct groups, which strongly corresponded to the known benign and malignant classes. The high purity of these clusters, with approximately 97% class dominance in each, highlights the capability of clustering algorithms to support diagnostic classification. Among the various methods evaluated, Partitioning Around Medoids (PAM) emerged as the most effective algorithm for this dataset, likely due to its robustness to outliers and its efficiency on smaller datasets.

Furthermore, dimensionality reduction via Principal Component Analysis (PCA) proved highly successful, reducing the feature space from nine dimensions to two principal components while still explaining approximately 75% of the dataset’s variance. The interpretability of these components was a significant finding; they were primarily driven by morphological features such as Uniform Cell Shape, Uniform Cell Size, and Bland Chromatin, alongside the cell division indicator Mitoses. When clustering was reapplied to this reduced two-dimensional dataset, the results were not only consistent but also yielded higher average silhouette widths, validating that the dimensionality reduction retained essential information and improved cluster cohesion. A comparative analysis with Multidimensional Scaling (MDS) confirmed PCA’s superiority for this dataset, as PCA achieved a more efficient and representative reduction in two dimensions according to Kruskal’s stress benchmarks.

In conclusion, this analysis validates that clustering is a powerful tool for identifying intrinsic data structures that align with clinical classifications. Moreover, the successful application of PCA followed by clustering underscores the value of a combined unsupervised learning workflow, which can simplify complex datasets, enhance model performance, and provide interpretable insights into the most significant features.

References

  1. Significance of Mitoses:
    https://www.ncbi.nlm.nih.gov/books/NBK9553/#:~:text=Mitoses%20are%20characteristic%20of%20malignant,dissymmetrical%20structures%20and%20atypical%20forms

  2. Trends of Cancerous vs Normal Cells:
    https://www.verywellhealth.com/cancer-cells-vs-normal-cells-2248794#:~:text=Under%20a%20microscope%2C%20normal%20cells,some%20are%20smaller%20than%20normal

  3. MDS breakdown:
    https://towardsdatascience.com/multidimensional-scaling-mds-for-dimensionality-reduction-and-data-visualization-d5252c8bc4c0/