library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cluster) # clustering & silhouette
library(factoextra) # PCA & clustering visualization

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(dbscan) # kNN distance (for outliers)

## 
## Attaching package: 'dbscan'
## 
## The following object is masked from 'package:stats':
## 
##     as.dendrogram

1. Business Context & Objectives

Michigan Medicine is exploring whether routinely collected personal health indicators (e.g., lab results, vitals, risk scores) can be used as a screening tool to flag patients who may have cancer.

In this project, we: Treat the problem as unsupervised learning: We do not use the cancer outcome label y for model building. We only use y afterwards to evaluate how well our unsupervised methods identify cancer.

Use three classes of methods:

Principal Component Analysis (PCA) to reduce dimensionality and denoise the predictors.
Clustering (on PCA scores) to see whether there are data-driven patient segments with different cancer risk.
Outlier detection methods to test the idea that “strongly abnormal health profiles” are more likely to correspond to cancer.

Evaluation focus: Because missing a cancer case is clinically much more costly than a false alarm, we prioritize Sensitivity (True Positive Rate) and are willing to tolerate more False Positives.

2. Data

#----------------------------------------------------------

# 2.1 Load data

#----------------------------------------------------------

# Change file name if needed

raw <- read.csv("/Users/wenyuhan/Downloads/wbc.csv")

glimpse(raw)

## Rows: 378
## Columns: 31
## $ X1  <dbl> 0.31042643, 0.28865540, 0.11940934, 0.28628899, 0.05750390, 0.2399…
## $ X2  <dbl> 0.15725397, 0.20290835, 0.09232330, 0.29455529, 0.24112276, 0.1663…
## $ X3  <dbl> 0.30177597, 0.28912998, 0.11436666, 0.26826066, 0.05473015, 0.2366…
## $ X4  <dbl> 0.17934252, 0.15970308, 0.05531283, 0.16131495, 0.02477200, 0.1297…
## $ X5  <dbl> 0.4076916, 0.4953507, 0.4493094, 0.3358310, 0.3012549, 0.4556288, …
## $ X6  <dbl> 0.18989633, 0.33010245, 0.13968468, 0.05607018, 0.12284522, 0.2194…
## $ X7  <dbl> 0.156138707, 0.107029053, 0.069259606, 0.060028116, 0.037207123, 0…
## $ X8  <dbl> 0.23762425, 0.15457256, 0.10318091, 0.14527833, 0.02940855, 0.1366…
## $ X9  <dbl> 0.4166667, 0.4580808, 0.3813131, 0.2055556, 0.3580808, 0.3106061, …
## $ X10 <dbl> 0.1621735, 0.3822662, 0.4020640, 0.1826032, 0.3173968, 0.2205139, …
## $ X11 <dbl> 0.05736013, 0.02668839, 0.06003983, 0.02621764, 0.01622307, 0.0550…
## $ X12 <dbl> 0.09467822, 0.08563914, 0.13627122, 0.43798621, 0.13182903, 0.0815…
## $ X13 <dbl> 0.06130142, 0.02949630, 0.05428073, 0.01946002, 0.01587900, 0.0514…
## $ X14 <dbl> 0.031300080, 0.014695610, 0.016619412, 0.013743047, 0.002620480, 0…
## $ X15 <dbl> 0.22942516, 0.08104157, 0.26831424, 0.08971003, 0.24662610, 0.1283…
## $ X16 <dbl> 0.09273891, 0.12563463, 0.09063598, 0.01988013, 0.10670832, 0.0902…
## $ X17 <dbl> 0.060277778, 0.042878788, 0.050126263, 0.033914141, 0.040101010, 0…
## $ X18 <dbl> 0.2491002, 0.1229400, 0.2691798, 0.2204963, 0.1120856, 0.1072173, …
## $ X19 <dbl> 0.16769854, 0.12520403, 0.17431193, 0.26492936, 0.25128046, 0.0900…
## $ X20 <dbl> 0.04854691, 0.05286541, 0.07162500, 0.03047828, 0.05828946, 0.0527…
## $ X21 <dbl> 0.25542512, 0.23372465, 0.08182142, 0.19103522, 0.03678406, 0.2073…
## $ X22 <dbl> 0.19296375, 0.22574627, 0.09701493, 0.28757996, 0.26492537, 0.2313…
## $ X23 <dbl> 0.24548035, 0.22750137, 0.07331042, 0.16958016, 0.03411524, 0.1965…
## $ X24 <dbl> 0.12927645, 0.10944259, 0.03187672, 0.08865022, 0.01400904, 0.0976…
## $ X25 <dbl> 0.4809483, 0.3964208, 0.4043452, 0.1706399, 0.3865152, 0.5166083, …
## $ X26 <dbl> 0.14554045, 0.24285201, 0.08490264, 0.01833687, 0.10517992, 0.1826…
## $ X27 <dbl> 0.190894569, 0.150958466, 0.070822684, 0.038602236, 0.054952077, 0…
## $ X28 <dbl> 0.44261168, 0.25027491, 0.21398625, 0.17226804, 0.08810997, 0.2250…
## $ X29 <dbl> 0.27833629, 0.31914055, 0.17445299, 0.08318549, 0.30356791, 0.2329…
## $ X30 <dbl> 0.11511216, 0.17571822, 0.14882592, 0.04361800, 0.12495081, 0.1834…
## $ y   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

summary(raw$y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05556 0.00000 1.00000

table(raw$y)

## 
##   0   1 
## 357  21

prop.table(table(raw$y))

## 
##          0          1 
## 0.94444444 0.05555556

We have: Predictors: X1–X30 (continuous health indicators). Outcome: y (0 = No cancer, 1 = Cancer), which will be used only for evaluation.

#----------------------------------------------------------

# 2.2 Separate predictors and outcome, scale predictors

#----------------------------------------------------------

# Outcome (kept aside for evaluation)

y <- raw$y

# Predictor matrix (all X1...X30)

X <- raw %>% select(starts_with("X"))

# Scale predictors (mean 0, sd 1) – important for PCA & distance-based methods

X_scaled <- scale(X)

# Basic EDA: distributions of selected variables

X %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Distribution of Health Indicators",
x = "Value", y = "Count")

These indicators are roughly continuous and on different scales, which justifies the use of scaling and PCA.

3. PCA to reduce dimention

#----------------------------------------------------------

# 3.1 Fit PCA on scaled predictors

#----------------------------------------------------------
pca_fit <- prcomp(X_scaled, center = FALSE, scale. = FALSE)

# Proportion of variance explained

pve <- pca_fit$sdev^2 / sum(pca_fit$sdev^2)
cumulative_pve <- cumsum(pve)

pve_tbl <- tibble(
PC = 1:length(pve),
PVE = pve,
CumPVE = cumulative_pve
)

pve_tbl %>% head(10)

## # A tibble: 10 × 3
##       PC    PVE CumPVE
##    <int>  <dbl>  <dbl>
##  1     1 0.376   0.376
##  2     2 0.226   0.602
##  3     3 0.0964  0.699
##  4     4 0.0733  0.772
##  5     5 0.0678  0.840
##  6     6 0.0402  0.880
##  7     7 0.0248  0.905
##  8     8 0.0160  0.921
##  9     9 0.0146  0.936
## 10    10 0.0137  0.949

# Scree plot

ggplot(pve_tbl, aes(x = PC, y = PVE)) +
geom_line() +
geom_point() +
labs(title = "Scree Plot: Variance Explained by Each Principal Component",
x = "Principal Component", y = "Proportion of Variance Explained")

ggplot(pve_tbl, aes(x = PC, y = CumPVE)) +
geom_line() +
geom_point() +
geom_hline(yintercept = 0.8, linetype = "dashed", color = "red") +
annotate("text", x = 5, y = 0.82, label = "80% variance", color = "red") +
labs(title = "Cumulative Variance Explained by Principal Components",
x = "Number of PCs", y = "Cumulative PVE")

The scree plot shows a steep drop after the first few components, meaning most variability is captured in the first 4–6 PCs. This indicates the original features are highly correlated and can be summarized with fewer dimensions.

The first X PCs explain ~80% of the variance, so reducing the data to X components preserves most of the information while reducing noise.

#----------------------------------------------------------

# 3.2 Select number of PCs that explain ~80% of the variance

#----------------------------------------------------------
num_pcs <- min(which(cumulative_pve >= 0.80))
num_pcs

## [1] 5

# PCA scores for selected components

pc_scores <- pca_fit$x[, 1:num_pcs]

head(pc_scores)

##             PC1        PC2        PC3        PC4         PC5
## [1,] -1.0599829  0.9740446 -1.2644291 -1.2958504 -0.36126075
## [2,] -0.4345808  0.3184465 -2.3336944 -0.7409305  0.43554949
## [3,]  2.4491849 -2.3563354 -0.7760504 -1.4233094 -1.37218377
## [4,]  2.6410221  1.6209466  1.5580247  0.5806082 -0.73798578
## [5,]  4.2039359 -2.5407498 -0.5769904 -0.4292987  0.28307268
## [6,]  0.2909962  0.3204955 -1.9265758 -0.4806329  0.06541547

loadings <- pca_fit$rotation[, 1:2]   # PC1 and PC2 loadings
round(loadings, 3)[1:20, ]            # show first 10 variables for readability

##        PC1    PC2
## X1  -0.183  0.287
## X2  -0.032  0.029
## X3  -0.199  0.270
## X4  -0.196  0.277
## X5  -0.121 -0.162
## X6  -0.259 -0.126
## X7  -0.270 -0.086
## X8  -0.277  0.023
## X9  -0.120 -0.159
## X10 -0.110 -0.316
## X11 -0.193  0.002
## X12  0.011 -0.126
## X13 -0.211 -0.001
## X14 -0.225  0.101
## X15  0.006 -0.265
## X16 -0.192 -0.207
## X17 -0.165 -0.191
## X18 -0.194 -0.160
## X19 -0.017 -0.196
## X20 -0.130 -0.254

Interpretation: A relatively small number of synthetic dimensions (principal components) capture most of the variation across patients. This suggests we can work with a compressed representation of patient health rather than all 30 raw indicators, which simplifies clustering and outlier detection while reducing noise.

PC1 has negative and similarly sized loadings across almost all variables (e.g., X6–X8, X13, X14). This indicates that PC1 represents a general abnormality or overall health deviation dimension: individuals with higherPC1 scores tend to have multiple health indicators that deviate from the population center at the same time.

PC2 shows a clear split between two different sets of variables. X1, X3, X4 have strong positive loadings, while X10, X15, X20, X16, X17, and X19 have strong negative loadings. This suggests that PC2 captures a contrast between two different patterns of abnormality: one pattern driven by variables such as X1–X4 , and another driven by variables such as X10, X15, X20, and related markers.

4. Clustering Patients in PCA Space

4.1 Choosing the Number of Clusters

We cluster in the reduced PC space (pc_scores).

#----------------------------------------------------------

# 4.1 Choose k using within-cluster sum of squares (Elbow)

#----------------------------------------------------------
set.seed(123)

wss <- map_dbl(2:6, function(k){
kmeans(pc_scores, centers = k, nstart = 20)$tot.withinss
})

elbow_tbl <- tibble(
k = 2:6,
wss = wss
)

ggplot(elbow_tbl, aes(x = k, y = wss)) +
geom_line() +
geom_point() +
labs(title = "Elbow Plot for K-Means Clustering",
x = "Number of Clusters (k)",
y = "Total Within-Cluster Sum of Squares")

#----------------------------------------------------------

# 4.2 Silhouette analysis for k = 2,...,6

#----------------------------------------------------------
sil_scores <- map_dbl(2:6, function(k){
km <- kmeans(pc_scores, centers = k, nstart = 20)
ss <- silhouette(km$cluster, dist(pc_scores))
mean(ss[, "sil_width"])
})

sil_tbl <- tibble(
k = 2:6,
avg_sil = sil_scores
)

ggplot(sil_tbl, aes(x = k, y = avg_sil)) +
geom_line() +
geom_point() +
labs(title = "Average Silhouette Width vs. Number of Clusters",
x = "Number of Clusters (k)", y = "Average Silhouette Width")

# pick k maximizing silhouette
best_k <- sil_tbl$k[which.max(sil_tbl$avg_sil)]
best_k

## [1] 2

Average silhouette is maximized at k = 2 and drops sharply afterwards, so we choose two clusters.

4.2 Fit K-Means with Selected k

km_final <- kmeans(pc_scores, centers = best_k, nstart = 50)
clusters <- km_final$cluster

cluster_df <- as_tibble(pc_scores) %>%
mutate(
cluster = factor(clusters),
y = y
)

# 2D visualization with first two PCs

ggplot(cluster_df, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7) +
labs(
title = paste("Patient Clusters in PCA Space (k =", best_k, ")"),
x = "PC1", y = "PC2", color = "Cluster"
)

#----------------------------------------------------------

# 4.3 Cluster profiles on original variables

#----------------------------------------------------------
cluster_profiles <- cluster_df %>%
bind_cols(as_tibble(X)) %>%
group_by(cluster) %>%
summarise(across(starts_with("X"), mean), .groups = "drop")

cluster_profiles

## # A tibble: 2 × 31
##   cluster    X1    X2    X3    X4    X5    X6     X7    X8    X9   X10    X11
##   <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 1       0.329 0.328 0.337 0.211 0.473 0.470 0.423  0.376 0.506 0.495 0.145 
## 2 2       0.247 0.282 0.239 0.137 0.356 0.172 0.0931 0.122 0.336 0.252 0.0599
## # ℹ 19 more variables: X12 <dbl>, X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>,
## #   X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>,
## #   X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>, X27 <dbl>, X28 <dbl>,
## #   X29 <dbl>, X30 <dbl>

#visuals
cluster_profiles_long <- cluster_profiles %>%
pivot_longer(cols = starts_with("X"), names_to = "variable", values_to = "mean_val")

ggplot(cluster_profiles_long, aes(x = variable, y = cluster, fill = mean_val)) +
geom_tile() +
coord_flip() +
labs(title = "Cluster Profiles on Original Health Indicators",
x = "Health Indicator", y = "Cluster") +
scale_fill_viridis_c()

Interpretation: The algorithm finds a small number of patient segments that share similar health profiles. Differences across clusters in specific indicator patterns may correspond to different risk levels, which we evaluate next using the cancer labels.

Cluster 1 tends to have higher values on most indicators (especially X1, X5, X9…), indicating generally more abnormal readings, while Cluster 2 has lower/near-normal values. This suggests Cluster 1 might correspond to a sicker subgroup.

5. Outlier Detection

We now test the hypothesis: “Patients with highly abnormal health profiles (outliers) are more likely to have cancer.” We use two outlier detection strategies: a. Mahalanobis distance in the standardized original feature space. b. kNN-distance (local density-based).

5.1 Mahalanobis Distance

#----------------------------------------------------------

# Mahalanobis distance based on scaled X

#----------------------------------------------------------
cov_mat <- cov(X_scaled)
center_vec <- colMeans(X_scaled)

md <- mahalanobis(X_scaled, center = center_vec, cov = cov_mat)

outlier_df <- tibble(
md = md,
y = y
)

ggplot(outlier_df, aes(x = md)) +
geom_histogram(bins = 30) +
labs(title = "Distribution of Mahalanobis Distances",
x = "Mahalanobis Distance", y = "Count")

Mahalanobis distance measures how far each data point is from the multivariate center. Observations in the right tail correspond to unusually abnormal combinations of health indicators.

# Choose a high quantile as cutoff (e.g., top 5% most extreme)

cutoff_md <- quantile(md, 0.95)

outlier_md_flag <- ifelse(md > cutoff_md, 1, 0)

table(outlier_md_flag)

## outlier_md_flag
##   0   1 
## 359  19

mean(outlier_md_flag)

## [1] 0.05026455

We use the top 5% as outliers to focus on extreme abnormality. This threshold can be adjusted based on the sensitivity Michigan Medicine wants to achieve.

Approximately 5% of patients are flagged as outliers. If outliers contain a higher proportion of cancer cases, this would support the hypothesis that cancer patients exhibit abnormal multivariate profiles.

5.2 kNN Distance (Density-Based)

#----------------------------------------------------------

# kNN distance for outlier detection

#----------------------------------------------------------
set.seed(123)
k <- 10
knn_res <- kNNdist(X_scaled, k = k)
knn_dist <- knn_res

knn_df <- tibble(
knn_dist = knn_dist,
y = y
)

ggplot(knn_df, aes(x = knn_dist)) +
geom_histogram(bins = 30) +
labs(title = paste0("Distribution of ", k, "-NN Distances"),
x = paste0(k, "-NN Distance"), y = "Count")

# Top 5% by kNN distance as outliers

cutoff_knn <- quantile(knn_dist, 0.95)
outlier_knn_flag <- ifelse(knn_dist > cutoff_knn, 1, 0)

table(outlier_knn_flag)

## outlier_knn_flag
##   0   1 
## 359  19

mean(outlier_knn_flag)

## [1] 0.05026455

For global kNN outliers we use uses k = 10 (minPts) to define denser neighborhoods.

5.3 DBSCAN

#----------------------------------------------------------
# 5.3 DBSCAN Clustering
#----------------------------------------------------------

# 1. Use kNN distance plot to choose eps
k <- 10
kNNdistplot(X_scaled, k = k)
abline(h = quantile(kNNdist(X_scaled, k), 0.95), col = "red", lty = 2)

Interpretation: The “elbow” or sharp increase in the kNN-distance plot helps choose eps. Here, we test eps at the 95th percentile of the 10-NN distances.

# choose eps based on visual elbow or top quantile
eps_val <- quantile(kNNdist(X_scaled, k), 0.95)

db <- dbscan(X_scaled, eps = eps_val, minPts = k)

db

## DBSCAN clustering for 378 objects.
## Parameters: eps = 7.08410700906318, minPts = 10
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 10 noise points.
## 
##   0   1 
##  10 368 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints

DBSCAN finds one main dense cluster and 10 noise points. Clinically, this means most patients look similar in the high-dimensional health space, with a small set of very unusual profiles (noise)

Visualize DBSCAN Clusters in PCA space

db_df <- as_tibble(pc_scores[, 1:2]) %>%
  # columns are already named PC1 and PC2
  mutate(
    db_cluster = factor(db$cluster),
    y = y
  )

ggplot(db_df, aes(x = PC1, y = PC2, color = db_cluster)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "DBSCAN Clusters in PCA Space",
    color = "DBSCAN Cluster"
  )

The pink points (cluster 1) form the main group; the grey points (cluster 0) are DBSCAN noise, corresponding to potential high-risk anomalies.

6. Evaluation: Mapping Unsupervised Results to Cancer Detection

In this section, we evaluate the performance of each unsupervised method—clustering, Mahalanobis outliers, kNN outliers, and DBSCAN—using the ground-truth cancer labels only for validation.

Because cancer is a rare but high-cost condition, Sensitivity (True Positive Rate) is the most important metric: missing a cancer case (False Negative) is clinically unacceptable. Specificity and Precision are also reported, but are secondary since false alarms are less harmful than missed diagnoses. We now evaluate each unsupervised method using the ground-truth label y.

For cancer detection: Positive class = Cancer (y = 1)

Define a helper function:

#----------------------------------------------------------

# Helper: Confusion matrix and metrics

#----------------------------------------------------------
calc_metrics <- function(truth, pred, positive = 1){
  
  truth <- as.integer(truth)
  pred  <- as.integer(pred)
  
  TP <- sum(truth == positive & pred == positive)
  TN <- sum(truth != positive & pred != positive)
  FP <- sum(truth != positive & pred == positive)
  FN <- sum(truth == positive & pred != positive)
  
  sensitivity <- ifelse((TP + FN) == 0, NA, TP / (TP + FN))  # recall / TPR
  specificity <- ifelse((TN + FP) == 0, NA, TN / (TN + FP))
  precision   <- ifelse((TP + FP) == 0, NA, TP / (TP + FP))
  accuracy    <- (TP + TN) / length(truth)
  
  tibble(
    TP = TP, FP = FP, FN = FN, TN = TN,
    Sensitivity = sensitivity,
    Specificity = specificity,
    Precision = precision,
    Accuracy = accuracy
  )
}

pred_dbscan <- ifelse(db$cluster == 0, 1, 0)

metrics_dbscan <- calc_metrics(truth = y, pred = pred_dbscan)
metrics_dbscan

## # A tibble: 1 × 8
##      TP    FP    FN    TN Sensitivity Specificity Precision Accuracy
##   <int> <int> <int> <int>       <dbl>       <dbl>     <dbl>    <dbl>
## 1     7     3    14   354       0.333       0.992       0.7    0.955

6.1 Cluster-Based Cancer Rule

We define a simple rule: For each cluster, compute the cancer rate. Any cluster whose cancer rate is above the overall prevalence is labeled a “high-risk cluster”. Patients in high-risk clusters are classified as “cancer = 1”; others as “cancer = 0”.

#----------------------------------------------------------

# Evaluate clustering as a cancer detector

#----------------------------------------------------------
overall_prev <- mean(y)

cluster_eval_df <- cluster_df %>%
group_by(cluster) %>%
summarise(
n = n(),
cancer_rate = mean(y == 1),
.groups = "drop"
)

cluster_eval_df

## # A tibble: 2 × 3
##   cluster     n cancer_rate
##   <fct>   <int>       <dbl>
## 1 1          37      0.405 
## 2 2         341      0.0176

# High-risk clusters: cancer rate > overall prevalence

high_risk_clusters <- cluster_eval_df %>%
filter(cancer_rate > overall_prev) %>%
pull(cluster)

high_risk_clusters

## [1] 1
## Levels: 1 2

# Predicted cancer label based on cluster rule

pred_cluster <- ifelse(cluster_df$cluster %in% high_risk_clusters, 1, 0)

metrics_cluster <- calc_metrics(truth = y, pred = pred_cluster)
metrics_cluster

## # A tibble: 1 × 8
##      TP    FP    FN    TN Sensitivity Specificity Precision Accuracy
##   <int> <int> <int> <int>       <dbl>       <dbl>     <dbl>    <dbl>
## 1    15    22     6   335       0.714       0.938     0.405    0.926

6.2 Outlier-Based Cancer Rules

We now treat outliers as “predicted cancer”:

# Mahalanobis-based rule

pred_md <- outlier_md_flag  # 1 = outlier => predicted cancer

metrics_md <- calc_metrics(truth = y, pred = pred_md)
metrics_md

## # A tibble: 1 × 8
##      TP    FP    FN    TN Sensitivity Specificity Precision Accuracy
##   <int> <int> <int> <int>       <dbl>       <dbl>     <dbl>    <dbl>
## 1     9    10    12   347       0.429       0.972     0.474    0.942

# kNN-based rule

pred_knn <- outlier_knn_flag

metrics_knn <- calc_metrics(truth = y, pred = pred_knn)
metrics_knn

## # A tibble: 1 × 8
##      TP    FP    FN    TN Sensitivity Specificity Precision Accuracy
##   <int> <int> <int> <int>       <dbl>       <dbl>     <dbl>    <dbl>
## 1    10     9    11   348       0.476       0.975     0.526    0.947

#compare all
metrics_all <- bind_rows(
metrics_cluster  %>% mutate(Method = "Clustering on PCs"),
metrics_md       %>% mutate(Method = "Mahalanobis Distance Outlier"),
metrics_knn      %>% mutate(Method = paste0(k, "-NN Distance Outlier"),)
) %>%
select(Method, everything())

metrics_all <- bind_rows(
  metrics_all,
  metrics_dbscan %>% mutate(Method = "DBSCAN Outlier (Noise Points)")
)

metrics_all

## # A tibble: 4 × 9
##   Method         TP    FP    FN    TN Sensitivity Specificity Precision Accuracy
##   <chr>       <int> <int> <int> <int>       <dbl>       <dbl>     <dbl>    <dbl>
## 1 Clustering…    15    22     6   335       0.714       0.938     0.405    0.926
## 2 Mahalanobi…     9    10    12   347       0.429       0.972     0.474    0.942
## 3 10-NN Dist…    10     9    11   348       0.476       0.975     0.526    0.947
## 4 DBSCAN Out…     7     3    14   354       0.333       0.992     0.7      0.955

6.3 Evaluate DBSCAN as a Cancer Detector

Idea: DBSCAN labels low-density points as noise → noise patients may be cancer cases.

# noise points = predicted cancer
pred_dbscan <- ifelse(db$cluster == 0, 1, 0)

metrics_dbscan <- calc_metrics(truth = y, pred = pred_dbscan)
metrics_dbscan

## # A tibble: 1 × 8
##      TP    FP    FN    TN Sensitivity Specificity Precision Accuracy
##   <int> <int> <int> <int>       <dbl>       <dbl>     <dbl>    <dbl>
## 1     7     3    14   354       0.333       0.992       0.7    0.955

Interpretation: Sensitivity tells us what fraction of true cancer patients our unsupervised rule successfully flags. Specificity tells us what fraction of non-cancer patients we correctly classify as low risk.

Cluster-based rule: Sens = 0.714 (captures 15/21 cancers), Spec = 0.938.
Mahalanobis/kNN: Sens = 0.429 (9/21 cancers) with higher specificity.
DBSCAN noise rule: Sens = 0.333 (7/21 cancers) but very high specificity = 0.992.

Depending on the method and threshold: Outlier rules often achieve higher sensitivity but at the cost of many false positives, meaning more patients would need follow-up tests. Cluster-based rules may give slightly more balanced trade-offs but can still miss a non-trivial fraction of cancer cases. Because cancer prevalence is low in this sample, a naive rule that labels everyone as “no cancer” would have good accuracy but zero sensitivity—which is unacceptable clinically. Our unsupervised methods improve on that by identifying a non-trivial share of cancer cases, though they still miss some.

Given Michigan Medicine’s priority of minimizing false negatives, the cluster-based rule is preferred among purely unsupervised options, even though it generates more false positives than the outlier-based rules.

7. Synthesis & Recommendation

7.1 Should Michigan Medicine Use This Data for Cancer Detection?

From our unsupervised analysis: The health indicators do contain signal related to cancer status:

Certain clusters and outliers are enriched for cancer cases.
Unsupervised rules achieve meaningful sensitivity relative to naive baselines.
No unsupervised method achieves near-perfect sensitivity. Some cancer patients still look “normal” in this data.

Suggestions: - Use this approach as a screening / triage tool, not as a standalone diagnostic. - It is suitable for flagging higher-risk patients for more intensive evaluation (e.g., imaging, specialist referral). - A negative result from these unsupervised rules must not be interpreted as “no cancer.”

Overall Recommendation:

Overall, we recommend adopting the cluster-based risk flag combined with Mahalanobis outlier checks as an early-warning signal, while continuing to rely on clinical judgment and confirmatory tests for diagnosis.

7.2 Best Unsupervised Workflow for Cancer Detection

Based on the analysis, a reasonable workflow for Michigan Medicine could be: a. Preprocess & Compress Data: - Standardize all health indicators. - Use PCA to reduce to a small number of components capturing ~80% of variance (e.g., 5–7 PCs).

Cluster Patients in PCA Space

Run k-means or similar clustering on the PC scores.
Identify clusters that historically have higher cancer prevalence (using labeled data in a development phase).
Label those clusters as higher-risk segments.

Overlay Outlier Detection

Compute Mahalanobis or kNN distances.
Treat the most extreme outliers (top 5–10%) as very high risk, even if they fall outside known high-risk clusters.

Risk Scoring

Combine cluster membership and outlier scores into a simple risk score:
Base risk component from cluster.
Additional “red flag” if outlier distance is high.
Set thresholds to achieve desired sensitivity (e.g., >90%), accepting a higher false-positive rate.

Clinical Integration

Use the risk score to prioritize patients for further testing.
Do not use it to cancel tests or deny care.
Continuously monitor performance (sensitivity, specificity, and calibration) as more labeled data accumulate.

7.3 Limitations & Next Steps

Label availability & representativeness: Our evaluation assumes the existing labeled sample is representative. If not, the performance estimates may not generalize to broader populations.
Feature set: We only used the available indicators. Adding richer data (e.g., longitudinal trends, imaging summaries, genetic markers) could substantially improve unsupervised separation.
Hybrid approaches: In practice, once labels are available at scale, a semi-supervised or supervised model (e.g., gradient boosting with regularization) will likely outperform pure unsupervised rules. The unsupervised approach remains useful for early-stage exploration and anomaly detection in large unlabeled populations.
Next steps: Pilot the unsupervised risk score on a larger retrospective cohort and confirm that sensitivity is acceptable at operational thresholds. Integrate clinician feedback to tune the “actionable” risk level. Gradually transition to semi-supervised / supervised models as more labeled data become available.

12.10-hw4-yuhan

Yuhan Wen

2025-12-10