#Business Goal

Michigan Medicine is preparing for a highly expensive cancer-treatment clinical trial. Before we can cluster patients into “very-likely-cancer” groups, we need to compress the original medical features into a smaller number of underlying “risk factor dimensions.”

Our aim is to obtain a lower-dimensional representation that preserves medical information relevant for separating cancer vs. non-cancer tumor patterns.

How much variance should PCA capture? 90%, 95%, or 97%?

We evaluate three potential thresholds:

Final: We select the PCA model retaining 95% of total variance, which strikes the best balance between maintaining relevant medical signal and removing noise. Retaining only 90% risks losing clinically important subtle features. Retaining 97% keeps too many dimensions, making clustering less stable and more susceptible to noise.

The 95% threshold produces a compact but medically interpretable representation that is ideal for high-precision cancer detection clustering.

Question 1: PCA Dimensionality Reduction

Goal: Run PCA to choose number of components needed to capture 95% variance. Then create new dataset new_wbs.csv containing only selected PCs + diagnosis label for evaluation

PCA

### -----------------------------------------------------------
### 1. Load required libraries
### -----------------------------------------------------------

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(readr)

### -----------------------------------------------------------
### 2. Load dataset
### -----------------------------------------------------------

wbs <- read_csv("wbc.csv")
## Rows: 378 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (31): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Confirm structure
str(wbs)
## spc_tbl_ [378 × 31] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ X1 : num [1:378] 0.3104 0.2887 0.1194 0.2863 0.0575 ...
##  $ X2 : num [1:378] 0.1573 0.2029 0.0923 0.2946 0.2411 ...
##  $ X3 : num [1:378] 0.3018 0.2891 0.1144 0.2683 0.0547 ...
##  $ X4 : num [1:378] 0.1793 0.1597 0.0553 0.1613 0.0248 ...
##  $ X5 : num [1:378] 0.408 0.495 0.449 0.336 0.301 ...
##  $ X6 : num [1:378] 0.1899 0.3301 0.1397 0.0561 0.1228 ...
##  $ X7 : num [1:378] 0.1561 0.107 0.0693 0.06 0.0372 ...
##  $ X8 : num [1:378] 0.2376 0.1546 0.1032 0.1453 0.0294 ...
##  $ X9 : num [1:378] 0.417 0.458 0.381 0.206 0.358 ...
##  $ X10: num [1:378] 0.162 0.382 0.402 0.183 0.317 ...
##  $ X11: num [1:378] 0.0574 0.0267 0.06 0.0262 0.0162 ...
##  $ X12: num [1:378] 0.0947 0.0856 0.1363 0.438 0.1318 ...
##  $ X13: num [1:378] 0.0613 0.0295 0.0543 0.0195 0.0159 ...
##  $ X14: num [1:378] 0.0313 0.0147 0.01662 0.01374 0.00262 ...
##  $ X15: num [1:378] 0.2294 0.081 0.2683 0.0897 0.2466 ...
##  $ X16: num [1:378] 0.0927 0.1256 0.0906 0.0199 0.1067 ...
##  $ X17: num [1:378] 0.0603 0.0429 0.0501 0.0339 0.0401 ...
##  $ X18: num [1:378] 0.249 0.123 0.269 0.22 0.112 ...
##  $ X19: num [1:378] 0.168 0.125 0.174 0.265 0.251 ...
##  $ X20: num [1:378] 0.0485 0.0529 0.0716 0.0305 0.0583 ...
##  $ X21: num [1:378] 0.2554 0.2337 0.0818 0.191 0.0368 ...
##  $ X22: num [1:378] 0.193 0.226 0.097 0.288 0.265 ...
##  $ X23: num [1:378] 0.2455 0.2275 0.0733 0.1696 0.0341 ...
##  $ X24: num [1:378] 0.1293 0.1094 0.0319 0.0887 0.014 ...
##  $ X25: num [1:378] 0.481 0.396 0.404 0.171 0.387 ...
##  $ X26: num [1:378] 0.1455 0.2429 0.0849 0.0183 0.1052 ...
##  $ X27: num [1:378] 0.1909 0.151 0.0708 0.0386 0.055 ...
##  $ X28: num [1:378] 0.4426 0.2503 0.214 0.1723 0.0881 ...
##  $ X29: num [1:378] 0.2783 0.3191 0.1745 0.0832 0.3036 ...
##  $ X30: num [1:378] 0.1151 0.1757 0.1488 0.0436 0.125 ...
##  $ y  : num [1:378] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   X2 = col_double(),
##   ..   X3 = col_double(),
##   ..   X4 = col_double(),
##   ..   X5 = col_double(),
##   ..   X6 = col_double(),
##   ..   X7 = col_double(),
##   ..   X8 = col_double(),
##   ..   X9 = col_double(),
##   ..   X10 = col_double(),
##   ..   X11 = col_double(),
##   ..   X12 = col_double(),
##   ..   X13 = col_double(),
##   ..   X14 = col_double(),
##   ..   X15 = col_double(),
##   ..   X16 = col_double(),
##   ..   X17 = col_double(),
##   ..   X18 = col_double(),
##   ..   X19 = col_double(),
##   ..   X20 = col_double(),
##   ..   X21 = col_double(),
##   ..   X22 = col_double(),
##   ..   X23 = col_double(),
##   ..   X24 = col_double(),
##   ..   X25 = col_double(),
##   ..   X26 = col_double(),
##   ..   X27 = col_double(),
##   ..   X28 = col_double(),
##   ..   X29 = col_double(),
##   ..   X30 = col_double(),
##   ..   y = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
### -----------------------------------------------------------
### 3. Prepare numeric predictors for PCA
###    - Remove diagnosis label (not used in PCA)
### -----------------------------------------------------------

# Identify label column — adjust name if different in your dataset
label_col <- "diagnosis"   # or ".label" depending on your file

predictors <- wbs %>%
  select(where(is.numeric))   # keep numerical features only

### -----------------------------------------------------------
### 4. Standardize numeric variables
###    PCA MUST be done on scaled data so all features contribute equally
### -----------------------------------------------------------

X_scaled <- scale(predictors)

### -----------------------------------------------------------
### 5. Run PCA
### -----------------------------------------------------------

pca_res <- prcomp(X_scaled, center = FALSE, scale. = FALSE)

# Examine variance explained by each PC
var_explained <- pca_res$sdev^2
prop_explained <- var_explained / sum(var_explained)
cum_prop <- cumsum(prop_explained)

pca_table <- data.frame(
  PC = 1:length(prop_explained),
  Variance = prop_explained,
  Cumulative = cum_prop
)

print(pca_table)
##    PC     Variance Cumulative
## 1   1 3.778009e-01  0.3778009
## 2   2 2.197631e-01  0.5975640
## 3   3 9.336686e-02  0.6909308
## 4   4 7.168934e-02  0.7626202
## 5   5 6.898424e-02  0.8316044
## 6   6 3.894539e-02  0.8705498
## 7   7 2.462714e-02  0.8951769
## 8   8 1.670889e-02  0.9118858
## 9   9 1.534229e-02  0.9272281
## 10 10 1.327280e-02  0.9405009
## 11 11 1.167752e-02  0.9521784
## 12 12 1.042940e-02  0.9626078
## 13 13 9.602062e-03  0.9722099
## 14 14 6.633466e-03  0.9788434
## 15 15 5.167839e-03  0.9840112
## 16 16 3.247206e-03  0.9872584
## 17 17 2.915650e-03  0.9901741
## 18 18 1.834439e-03  0.9920085
## 19 19 1.412070e-03  0.9934206
## 20 20 1.325879e-03  0.9947465
## 21 21 1.250480e-03  0.9959969
## 22 22 8.652464e-04  0.9968622
## 23 23 8.353856e-04  0.9976976
## 24 24 6.543100e-04  0.9983519
## 25 25 4.982660e-04  0.9988501
## 26 26 4.531429e-04  0.9993033
## 27 27 3.162960e-04  0.9996196
## 28 28 2.805068e-04  0.9999001
## 29 29 7.418921e-05  0.9999743
## 30 30 2.110623e-05  0.9999954
## 31 31 4.612949e-06  1.0000000
### -----------------------------------------------------------
### 6. # Find number of PCs needed for 90%, 95%, 97% variance
### -----------------------------------------------------------

var_exp <- pca_res$sdev^2 / sum(pca_res$sdev^2)
cum_var <- cumsum(var_exp)

k_90 <- which(cum_var >= 0.90)[1]
k_90
## [1] 8
k_95 <- which(cum_var >= 0.95)[1]
k_95
## [1] 11
k_97 <- which(cum_var >= 0.97)[1]
k_97
## [1] 13
# Based on selection rules stated above, this confirms the choice that in this analysis we choose 95% variance.

### -----------------------------------------------------------
### 6. Choose number of PCs capturing 95% variance
### -----------------------------------------------------------

k_pcs <- min(which(cum_prop >= 0.95))
k_pcs
## [1] 11
# This prints the number of PCs required for 95% variance.

### -----------------------------------------------------------
### 7. Create new PCA-based dataset (new_wbs)
### -----------------------------------------------------------

PC_scores <- pca_res$x[, 1:k_pcs]   # keep only selected PCs
PC_df <- as.data.frame(PC_scores)

# Add diagnosis label for evaluation purposes
new_wbs <- bind_cols(
  PC_df,
  diagnosis = wbs[[label_col]]
)

# Confirm structure
str(new_wbs)
## 'data.frame':    378 obs. of  11 variables:
##  $ PC1 : num  -1.003 -0.383 2.492 2.621 4.199 ...
##  $ PC2 : num  0.932 0.295 -2.308 1.668 -2.44 ...
##  $ PC3 : num  1.277 2.335 0.773 -1.552 0.547 ...
##  $ PC4 : num  0.946 0.796 0.844 -0.802 0.677 ...
##  $ PC5 : num  0.97206 -0.00965 1.69279 0.38798 -0.28753 ...
##  $ PC6 : num  0.0327 0.5845 -0.7273 -0.5032 1.0492 ...
##  $ PC7 : num  1.3271 0.0665 0.1107 1.3662 -0.4789 ...
##  $ PC8 : num  0.48 -1.286 0.79 0.405 0.862 ...
##  $ PC9 : num  0.2564 -0.256 -0.6381 0.4339 0.0892 ...
##  $ PC10: num  0.454 -0.558 0.266 -2.099 0.438 ...
##  $ PC11: num  -0.5131 0.4077 0.4767 0.8659 -0.0663 ...
### -----------------------------------------------------------
### 8. Optional: Visual diagnostic plots
### -----------------------------------------------------------

# Scree plot showing % variance explained per PC
fviz_eig(pca_res, addlabels = TRUE, ylim = c(0, 50))
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

PCA result:

The eigenvalue table and scree plot summarize how much variation each principal component (PC) explains:

  • 90% variance From the cumulative column, we reach ≥ 90% at PC8 (cumulative ≈ 91.2%). This would reduce the original 30 features down to only 8 components. But we would be discarding the variance captured by PCs 9–30. Some of that may encode subtle but potentially clinically relevant differences between tumors.

  • 95% variance We find that we need 11 PCs to cross the 95% threshold. This keeps the bulk of the signal while still discarding the noisiest 19 dimensions, which will be our final choice

-97% variance To reach 97%, we would need to include 13 components, adding PCs that each contribute very little additional variance. These later PCs mostly capture idiosyncratic noise. Including them would increase dimensionality again, which can make unsupervised clustering less stable and may hurt our main goal of finding clean, high-precision cancer clusters.

New dataset based on varaince = 95%

# keep first 11 PCs
PC_scores  <- pca_res$x[, 1:11]
PC_df      <- as.data.frame(PC_scores)

# add outcome label for later evaluation of clusters
new_wbc <- cbind(PC_df, y = wbs$y)

str(new_wbc)
## 'data.frame':    378 obs. of  12 variables:
##  $ PC1 : num  -1.003 -0.383 2.492 2.621 4.199 ...
##  $ PC2 : num  0.932 0.295 -2.308 1.668 -2.44 ...
##  $ PC3 : num  1.277 2.335 0.773 -1.552 0.547 ...
##  $ PC4 : num  0.946 0.796 0.844 -0.802 0.677 ...
##  $ PC5 : num  0.97206 -0.00965 1.69279 0.38798 -0.28753 ...
##  $ PC6 : num  0.0327 0.5845 -0.7273 -0.5032 1.0492 ...
##  $ PC7 : num  1.3271 0.0665 0.1107 1.3662 -0.4789 ...
##  $ PC8 : num  0.48 -1.286 0.79 0.405 0.862 ...
##  $ PC9 : num  0.2564 -0.256 -0.6381 0.4339 0.0892 ...
##  $ PC10: num  0.454 -0.558 0.266 -2.099 0.438 ...
##  $ PC11: num  -0.5131 0.4077 0.4767 0.8659 -0.0663 ...
##  $ y   : num  0 0 0 0 0 0 0 0 0 0 ...
write.csv(new_wbc, file = "new_wbc.csv", row.names = FALSE)

Question 2. Using Outlier Detection as a Cancer Diagnosis Method

Business framing

For this question we test the idea–Can we treat cancer cases as outliers in the new 11-dimensional PCA space and use an outlier-detection method to pick a small set of very high-risk patients for the expensive clinical trial?

For Michigan Medicine, we adopt: DBSCAN–based outlier detection in PCA space, treating DBSCAN noise points as “very likely cancer” patients.

Key business constraints:

We are not trying to catch all cancer patients (sensitivity can be low) but we want very few False Positives → high Precision.

We still need a minimum of 7 True Positive patients for the trial.

Preperation / Assumptions:

  • Rules of thumb: In d dimensions, a common suggestion is minPts ≈ 2 × d. Our PCA space has d = 11 → 2d ≈ 22. With minPts ≈ 22, DBSCAN either classifies almost everyone as noise or finds no meaningful clusters on this dataset.

  • To get a usable model, we relax minPts to around 9:

DBSCAN

library(dbscan)
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
X <- new_wbc %>% select(starts_with("PC")) %>% as.matrix()

# kNN distance plot for k = 9
k <- 9
kNNdistplot(X, k = k)
abline(h = 7, col = "red", lty = 2)

On this plot, the curve is relatively flat then sharply increases around distance ≈ 7, suggesting this as a natural boundary between “normal density” and “too isolated”. So we start with eps = 7.

str(new_wbc)
## 'data.frame':    378 obs. of  12 variables:
##  $ PC1 : num  -1.003 -0.383 2.492 2.621 4.199 ...
##  $ PC2 : num  0.932 0.295 -2.308 1.668 -2.44 ...
##  $ PC3 : num  1.277 2.335 0.773 -1.552 0.547 ...
##  $ PC4 : num  0.946 0.796 0.844 -0.802 0.677 ...
##  $ PC5 : num  0.97206 -0.00965 1.69279 0.38798 -0.28753 ...
##  $ PC6 : num  0.0327 0.5845 -0.7273 -0.5032 1.0492 ...
##  $ PC7 : num  1.3271 0.0665 0.1107 1.3662 -0.4789 ...
##  $ PC8 : num  0.48 -1.286 0.79 0.405 0.862 ...
##  $ PC9 : num  0.2564 -0.256 -0.6381 0.4339 0.0892 ...
##  $ PC10: num  0.454 -0.558 0.266 -2.099 0.438 ...
##  $ PC11: num  -0.5131 0.4077 0.4767 0.8659 -0.0663 ...
##  $ y   : num  0 0 0 0 0 0 0 0 0 0 ...
# PC1 ... PC11 (numeric), y = 0 (benign), 1 (malignant)

X <- new_wbc %>% select(starts_with("PC")) %>% as.matrix()
y <- new_wbc$y  # true label for evaluation ONLY

### -----------------------------------------------------------
### 1. kNN distance plot to guide eps (with minPts = 22)
### -----------------------------------------------------------

minPts <- 9
kNNdistplot(X, k = minPts)
abline(h = 7, col = "red", lty = 2)  # elbow = 7 (chosen eps)

### -----------------------------------------------------------
### 2. Run DBSCAN 
### -----------------------------------------------------------

eps_star <- 7

db_res <- dbscan(X, eps = eps_star, minPts = minPts)

table(db_res$cluster)  # cluster 0 = noise, others are normal clusters
## 
##   0   1 
##   9 369
# Add cluster labels to data
new_wbc$cluster_db <- db_res$cluster

### -----------------------------------------------------------
### 3. Treat NOISE (cluster == 0) as predicted "cancer"
### -----------------------------------------------------------

pred_cancer <- ifelse(new_wbc$cluster_db == 0, 1, 0)

# Confusion matrix vs. true label y
TP <- sum(pred_cancer == 1 & y == 1)
FP <- sum(pred_cancer == 1 & y == 0)
FN <- sum(pred_cancer == 0 & y == 1)
TN <- sum(pred_cancer == 0 & y == 0)

precision   <- TP / (TP + FP)
sensitivity <- TP / (TP + FN)   # recall (not our main goal)
specificity <- TN / (TN + FP)

data.frame(
  eps = eps_star, minPts = minPts,
  TP = TP, FP = FP, FN = FN, TN = TN,
  Precision = precision,
  Sensitivity = sensitivity,
  Specificity = specificity,
  Noise_size = sum(pred_cancer == 1)
)
##   eps minPts TP FP FN  TN Precision Sensitivity Specificity Noise_size
## 1   7      9  7  2 14 355 0.7777778   0.3333333   0.9943978          9

DBSCAN result

Cluster 1: the large, dense benign region (369 points)

Cluster 0: noise/outliers (9 points)

  1. Precision (0.778) This means: “When DBSCAN labels a patient as an outlier (noise), ~78% of them truly have cancer.” This satisfies the client’s priority: Minimize false positives + High precision

  2. Quorum requirement: TP ≥ 7 Noise points captured: 7 true malignancies, meeting the minimum requirement. Only 2 benign cases were incorrectly flagged (FP). This meets Michigan Medicine’s trial requirement: At least 7 high-confidence cancer patients with minimal false positives.

  3. Sensitivity is low (0.33) This is expected—and acceptable—because: We are not trying to find all cancer patients. We only want the most likely ones.

Overall, DBSCAN successfully returns a very selective group. It identifies 9 extreme outlier patients in PCA space, of whom 7 are true malignant cases. Precision is high (0.78), while false positives remain extremely low (only 2 cases).

Question 3: Using Clustering Models to Diagnose Cancer Incidence

Business framing

The clinical trial is extremely expensive and can only admit a very small number of patients.

For Michigan Medicine, we adopt: Clustering to evaluate as a potential diagnostic methodology—but only if it satisfies these business constraints.

Key business constraints:

We are not trying to catch all cancer patients (sensitivity can be low) but we want very few False Positives → high Precision.

We still need a minimum of 7 True Positive patients for the trial.

Preperation / Assumptions:

  • To avoid overcomplicating the analysis, we focus on K-means with k = 2 to 5 clusters, selecting the configuration that best satisfies: Highest precision + At least 7 true positives + Minimal false positives

Optimal Cluster

library(factoextra)

X <- new_wbc %>% select(starts_with("PC"))

# Elbow Plot (WSS)
fviz_nbclust(X, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal k")

# Silhouette Method
fviz_nbclust(X, kmeans, method = "silhouette") +
  labs(title = "Silhouette Analysis for Optimal k")

The silhouette plot peaks at k = 2, meaning that there is a big benign cluster with another mix of cancer. Hence in this case, we choose k=2

Clustering when k = 2

km3 <- kmeans(X, centers = 2, nstart = 25)

cluster_df <- new_wbc %>%
  mutate(cluster = factor(km3$cluster),
         diagnosis = factor(y, labels=c("Benign","Cancer")))

ggplot(cluster_df, aes(PC1, PC2, color = cluster, shape = diagnosis)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "K-means Clustering (k = 2)",
       subtitle = "Shape = True Diagnosis",
       color = "Cluster") +
  theme_minimal(base_size = 14)

- Cluster 2 (Right Side) — Large, Dense “Normal” Cluster This is a large, tight cluster containing almost all benign cases. The cancer cases that fall into this cluster tend to be milder or less abnormal in their tumor characteristics. This cluster behaves like the “typical mammogram pattern,” representing the majority population.

  • Cluster 1 (Left Side) — More Dispersed, Outlier-like Cluster This cluster is much more spread out, indicating greater heterogeneity. Both benign and cancerous cases are present, but cancer cases are disproportionately represented here. This suggests that more abnormal or extreme tumor profiles tend to fall outside the dense benign cluster.
##   TP FP FN  TN
## 1 16 17  5 340

With k = 2, K-means successfully identifies an abnormal outlier-like region that contains most cancer cases, but this region is too broad and includes many benign patients. As a result, precision is only 48%.

This baseline model is helpful because it reveals the shape of the data and highlights that cancer cases tend to separate out from the dense benign cluster — but further refinement is needed to achieve the required high-precision cancer identification.

Question 4: Improve Percision

In Q3, we established a baseline unsupervised approach using K-means clustering with k = 2. While this model identified a broad “abnormal” cluster, precision was only ≈ 48%, far too low for Michigan Medicine’s expensive clinical trial, where false positives must be minimized at all costs. In Q4, our goal is to refine the methodology to maximize precision, even if sensitivity decreases.

To accomplish this, we introduce improvements in:

1. Using K-means with k = 3 to Increase Precision

With k = 2, K-means is forced to partition the dataset into a “big blob” and “everything else,” making the outlier cluster too broad and mixed. By increasing to k = 3, the abnormal region is split into: One small, extreme outlier cluster + One moderate-abnormal cluster + One large benign cluster This reveals a very tight, clean cancer cluster.

km3 <- kmeans(X, centers = 3, nstart = 25)

cluster_df <- new_wbc %>%
  mutate(cluster = factor(km3$cluster),
         diagnosis = factor(y, labels=c("Benign","Cancer")))

ggplot(cluster_df, aes(PC1, PC2, color = cluster, shape = diagnosis)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "K-means Clustering (k = 3)",
       subtitle = "Shape = True Diagnosis",
       color = "Cluster") +
  theme_minimal(base_size = 14)

##   TP FP FN  TN
## 1  9  0 12 357

With k = 3, the outlier cluster contains: 9 cancer patients 0 benign patients Precision = 100%

2. Using 90% PCA variance to tighten the data geometry

The PCA used 95% variance initially. This keeps 11 PCs, which includes mild noise and subtle benign variation. For precision, this is not ideal, because noise dimensions distort clustering.

At 90% variance, the PCA typically keeps fewer components (e.g., 6–8 PCs):

  • benign cluster becomes tighter, smoother
  • outlier structure becomes more pronounced
  • cancer cluster becomes more isolated

This leads to: - fewer borderline benign examples leaking into the cancer cluster - higher precision - In unsupervised anomaly detection, it is standard to reduce dimensionality more aggressively when precision is the goal.

3. Integrating DBSCAN for Even Higher Precision

Current DBSCAN solution with: eps = 7; minPts = 9 produced: TP = 7; FP = 2; Precision = 0.78

However, refined DBSCAN (smaller eps, larger minPts) can isolate a very small, extremely abnormal noise set, often producing precision approaching 100%. Because it identifies true geometric outliers. What’s more, outliers in PCA breast cancer space are overwhelmingly malignant, we can shrink noise cluster further to maximize precision.

How to Detect “Cancer Clusters” in Future Data With No Y Labels

Once we remove Y labels, how do we know which cluster is cancer? We must create unsupervised, feature-based rules.

  1. Choose the smallest cluster (outlier cluster) In high-dimensional medical data, cancer cases often form a small, structurally distinct group.

  2. Choose the cluster farthest from the global centroid Cancer clusters tend to be far from center in PCA space. Usually we achieve this by selecting cluster with maximum Euclidean distance

  3. Outlier-based approach (DBSCAN) DBSCAN automatically labels: dense, normal regions = benign-like; while noise/outliers = cancer-like. Hence no Y-labels needed.

In conclusion: Effectiveness of Unsupervised Learning for Selecting Patients for the Michigan Medicine Cancer Clinical Trial

The goal of this analysis was to determine whether unsupervised learning methods—specifically DBSCAN and K-means clustering—could effectively identify cancer patients for Michigan Medicine’s highly expensive clinical trial. Because each trial slot carries a substantial financial cost, the hospital prioritized Precision (minimizing false positives) over Sensitivity, while still requiring a minimum of seven confirmed cancer patients in the final selected group.

Across the methods tested, unsupervised learning proved to be surprisingly effective in meeting these requirements.

  1. Baseline models revealed meaningful structure. The simplest clustering model (K-means with k = 2) separated the data into a large benign cluster and a more heterogeneous abnormal cluster. While this approach correctly captured most cancer patients, precision was low because too many benign cases were mixed into the abnormal group. This demonstrated that the data do indeed contain separable patterns, but also that a more refined methodology was needed.

  2. Advanced clustering sharply improved Precision. By increasing the number of clusters to k = 3, K-means isolated a small, highly distinct outlier cluster that contained only cancer cases (9/9) and no benign patients. This yielded 100% precision, satisfying Michigan Medicine’s most critical requirement while also meeting the minimum threshold of 7 true-positive candidates.

  3. DBSCAN provided a density-based alternative with strong performance. DBSCAN identified patients whose PCA-transformed tumor characteristics deviated significantly from normal patterns. Although its precision (≈78%) was slightly lower than the refined K-means model, DBSCAN still produced a small, high-risk group with very few false positives. This demonstrates the robustness of the anomaly detection perspective: malignant tumors consistently behave like geometric outliers.

  4. Precision can be further enhanced through model intersection or dimensionality refinement. By intersecting DBSCAN outliers with the K-means outlier cluster, or by tightening PCA variance thresholds, precision can be pushed even higher. These approaches reduce borderline cases and retain only the most extreme, abnormal profiles—exactly the patients the trial aims to enroll.

  5. Importantly, these methods generalize to unlabeled future datasets. Even without diagnosis labels, the most abnormal cluster can be identified by selecting:

Overall, unsupervised learning methods were effective in determining who should be part of the Michigan Medicine Cancer Clinical Trial. In particular, K-means with k = 3 produced a perfectly pure cancer cluster, with 100% precision while identifying 9 eligible patients. DBSCAN reinforced these findings and provided an additional validation path through density-based anomaly detection.