library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster) # clustering & silhouette
library(factoextra) # PCA & clustering visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dbscan) # kNN distance (for outliers)
##
## Attaching package: 'dbscan'
##
## The following object is masked from 'package:stats':
##
## as.dendrogram
Michigan Medicine is exploring whether routinely collected personal health indicators (e.g., lab results, vitals, risk scores) can be used as a screening tool to flag patients who may have cancer.
In this project, we: Treat the problem as unsupervised learning: We do not use the cancer outcome label y for model building. We only use y afterwards to evaluate how well our unsupervised methods identify cancer.
Use three classes of methods:
Principal Component Analysis (PCA) to reduce dimensionality and denoise the predictors.
Clustering (on PCA scores) to see whether there are data-driven patient segments with different cancer risk.
Outlier detection methods to test the idea that “strongly abnormal health profiles” are more likely to correspond to cancer.
Evaluation focus: Because missing a cancer case is clinically much more costly than a false alarm, we prioritize Sensitivity (True Positive Rate) and are willing to tolerate more False Positives.
#----------------------------------------------------------
# 2.1 Load data
#----------------------------------------------------------
# Change file name if needed
raw <- read.csv("/Users/wenyuhan/Downloads/wbc.csv")
glimpse(raw)
## Rows: 378
## Columns: 31
## $ X1 <dbl> 0.31042643, 0.28865540, 0.11940934, 0.28628899, 0.05750390, 0.2399…
## $ X2 <dbl> 0.15725397, 0.20290835, 0.09232330, 0.29455529, 0.24112276, 0.1663…
## $ X3 <dbl> 0.30177597, 0.28912998, 0.11436666, 0.26826066, 0.05473015, 0.2366…
## $ X4 <dbl> 0.17934252, 0.15970308, 0.05531283, 0.16131495, 0.02477200, 0.1297…
## $ X5 <dbl> 0.4076916, 0.4953507, 0.4493094, 0.3358310, 0.3012549, 0.4556288, …
## $ X6 <dbl> 0.18989633, 0.33010245, 0.13968468, 0.05607018, 0.12284522, 0.2194…
## $ X7 <dbl> 0.156138707, 0.107029053, 0.069259606, 0.060028116, 0.037207123, 0…
## $ X8 <dbl> 0.23762425, 0.15457256, 0.10318091, 0.14527833, 0.02940855, 0.1366…
## $ X9 <dbl> 0.4166667, 0.4580808, 0.3813131, 0.2055556, 0.3580808, 0.3106061, …
## $ X10 <dbl> 0.1621735, 0.3822662, 0.4020640, 0.1826032, 0.3173968, 0.2205139, …
## $ X11 <dbl> 0.05736013, 0.02668839, 0.06003983, 0.02621764, 0.01622307, 0.0550…
## $ X12 <dbl> 0.09467822, 0.08563914, 0.13627122, 0.43798621, 0.13182903, 0.0815…
## $ X13 <dbl> 0.06130142, 0.02949630, 0.05428073, 0.01946002, 0.01587900, 0.0514…
## $ X14 <dbl> 0.031300080, 0.014695610, 0.016619412, 0.013743047, 0.002620480, 0…
## $ X15 <dbl> 0.22942516, 0.08104157, 0.26831424, 0.08971003, 0.24662610, 0.1283…
## $ X16 <dbl> 0.09273891, 0.12563463, 0.09063598, 0.01988013, 0.10670832, 0.0902…
## $ X17 <dbl> 0.060277778, 0.042878788, 0.050126263, 0.033914141, 0.040101010, 0…
## $ X18 <dbl> 0.2491002, 0.1229400, 0.2691798, 0.2204963, 0.1120856, 0.1072173, …
## $ X19 <dbl> 0.16769854, 0.12520403, 0.17431193, 0.26492936, 0.25128046, 0.0900…
## $ X20 <dbl> 0.04854691, 0.05286541, 0.07162500, 0.03047828, 0.05828946, 0.0527…
## $ X21 <dbl> 0.25542512, 0.23372465, 0.08182142, 0.19103522, 0.03678406, 0.2073…
## $ X22 <dbl> 0.19296375, 0.22574627, 0.09701493, 0.28757996, 0.26492537, 0.2313…
## $ X23 <dbl> 0.24548035, 0.22750137, 0.07331042, 0.16958016, 0.03411524, 0.1965…
## $ X24 <dbl> 0.12927645, 0.10944259, 0.03187672, 0.08865022, 0.01400904, 0.0976…
## $ X25 <dbl> 0.4809483, 0.3964208, 0.4043452, 0.1706399, 0.3865152, 0.5166083, …
## $ X26 <dbl> 0.14554045, 0.24285201, 0.08490264, 0.01833687, 0.10517992, 0.1826…
## $ X27 <dbl> 0.190894569, 0.150958466, 0.070822684, 0.038602236, 0.054952077, 0…
## $ X28 <dbl> 0.44261168, 0.25027491, 0.21398625, 0.17226804, 0.08810997, 0.2250…
## $ X29 <dbl> 0.27833629, 0.31914055, 0.17445299, 0.08318549, 0.30356791, 0.2329…
## $ X30 <dbl> 0.11511216, 0.17571822, 0.14882592, 0.04361800, 0.12495081, 0.1834…
## $ y <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
summary(raw$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05556 0.00000 1.00000
table(raw$y)
##
## 0 1
## 357 21
prop.table(table(raw$y))
##
## 0 1
## 0.94444444 0.05555556
We have: Predictors: X1–X30 (continuous health indicators). Outcome: y (0 = No cancer, 1 = Cancer), which will be used only for evaluation.
#----------------------------------------------------------
# 2.2 Separate predictors and outcome, scale predictors
#----------------------------------------------------------
# Outcome (kept aside for evaluation)
y <- raw$y
# Predictor matrix (all X1...X30)
X <- raw %>% select(starts_with("X"))
# Scale predictors (mean 0, sd 1) – important for PCA & distance-based methods
X_scaled <- scale(X)
# Basic EDA: distributions of selected variables
X %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Distribution of Health Indicators",
x = "Value", y = "Count")
These indicators are roughly continuous and on different scales, which
justifies the use of scaling and PCA.
#----------------------------------------------------------
# 3.1 Fit PCA on scaled predictors
#----------------------------------------------------------
pca_fit <- prcomp(X_scaled, center = FALSE, scale. = FALSE)
# Proportion of variance explained
pve <- pca_fit$sdev^2 / sum(pca_fit$sdev^2)
cumulative_pve <- cumsum(pve)
pve_tbl <- tibble(
PC = 1:length(pve),
PVE = pve,
CumPVE = cumulative_pve
)
pve_tbl %>% head(10)
## # A tibble: 10 × 3
## PC PVE CumPVE
## <int> <dbl> <dbl>
## 1 1 0.376 0.376
## 2 2 0.226 0.602
## 3 3 0.0964 0.699
## 4 4 0.0733 0.772
## 5 5 0.0678 0.840
## 6 6 0.0402 0.880
## 7 7 0.0248 0.905
## 8 8 0.0160 0.921
## 9 9 0.0146 0.936
## 10 10 0.0137 0.949
# Scree plot
ggplot(pve_tbl, aes(x = PC, y = PVE)) +
geom_line() +
geom_point() +
labs(title = "Scree Plot: Variance Explained by Each Principal Component",
x = "Principal Component", y = "Proportion of Variance Explained")
ggplot(pve_tbl, aes(x = PC, y = CumPVE)) +
geom_line() +
geom_point() +
geom_hline(yintercept = 0.8, linetype = "dashed", color = "red") +
annotate("text", x = 5, y = 0.82, label = "80% variance", color = "red") +
labs(title = "Cumulative Variance Explained by Principal Components",
x = "Number of PCs", y = "Cumulative PVE")
The scree plot shows a steep drop after the first few components,
meaning most variability is captured in the first 4–6 PCs. This
indicates the original features are highly correlated and can be
summarized with fewer dimensions.
The first X PCs explain ~80% of the variance, so reducing the data to X components preserves most of the information while reducing noise.
#----------------------------------------------------------
# 3.2 Select number of PCs that explain ~80% of the variance
#----------------------------------------------------------
num_pcs <- min(which(cumulative_pve >= 0.80))
num_pcs
## [1] 5
# PCA scores for selected components
pc_scores <- pca_fit$x[, 1:num_pcs]
head(pc_scores)
## PC1 PC2 PC3 PC4 PC5
## [1,] -1.0599829 0.9740446 -1.2644291 -1.2958504 -0.36126075
## [2,] -0.4345808 0.3184465 -2.3336944 -0.7409305 0.43554949
## [3,] 2.4491849 -2.3563354 -0.7760504 -1.4233094 -1.37218377
## [4,] 2.6410221 1.6209466 1.5580247 0.5806082 -0.73798578
## [5,] 4.2039359 -2.5407498 -0.5769904 -0.4292987 0.28307268
## [6,] 0.2909962 0.3204955 -1.9265758 -0.4806329 0.06541547
loadings <- pca_fit$rotation[, 1:2] # PC1 and PC2 loadings
round(loadings, 3)[1:20, ] # show first 10 variables for readability
## PC1 PC2
## X1 -0.183 0.287
## X2 -0.032 0.029
## X3 -0.199 0.270
## X4 -0.196 0.277
## X5 -0.121 -0.162
## X6 -0.259 -0.126
## X7 -0.270 -0.086
## X8 -0.277 0.023
## X9 -0.120 -0.159
## X10 -0.110 -0.316
## X11 -0.193 0.002
## X12 0.011 -0.126
## X13 -0.211 -0.001
## X14 -0.225 0.101
## X15 0.006 -0.265
## X16 -0.192 -0.207
## X17 -0.165 -0.191
## X18 -0.194 -0.160
## X19 -0.017 -0.196
## X20 -0.130 -0.254
Interpretation: A relatively small number of synthetic dimensions (principal components) capture most of the variation across patients. This suggests we can work with a compressed representation of patient health rather than all 30 raw indicators, which simplifies clustering and outlier detection while reducing noise.
PC1 has negative and similarly sized loadings across almost all variables (e.g., X6–X8, X13, X14). This indicates that PC1 represents a general abnormality or overall health deviation dimension: individuals with higherPC1 scores tend to have multiple health indicators that deviate from the population center at the same time.
PC2 shows a clear split between two different sets of variables. X1, X3, X4 have strong positive loadings, while X10, X15, X20, X16, X17, and X19 have strong negative loadings. This suggests that PC2 captures a contrast between two different patterns of abnormality: one pattern driven by variables such as X1–X4 , and another driven by variables such as X10, X15, X20, and related markers.
We cluster in the reduced PC space (pc_scores).
#----------------------------------------------------------
# 4.1 Choose k using within-cluster sum of squares (Elbow)
#----------------------------------------------------------
set.seed(123)
wss <- map_dbl(2:6, function(k){
kmeans(pc_scores, centers = k, nstart = 20)$tot.withinss
})
elbow_tbl <- tibble(
k = 2:6,
wss = wss
)
ggplot(elbow_tbl, aes(x = k, y = wss)) +
geom_line() +
geom_point() +
labs(title = "Elbow Plot for K-Means Clustering",
x = "Number of Clusters (k)",
y = "Total Within-Cluster Sum of Squares")
#----------------------------------------------------------
# 4.2 Silhouette analysis for k = 2,...,6
#----------------------------------------------------------
sil_scores <- map_dbl(2:6, function(k){
km <- kmeans(pc_scores, centers = k, nstart = 20)
ss <- silhouette(km$cluster, dist(pc_scores))
mean(ss[, "sil_width"])
})
sil_tbl <- tibble(
k = 2:6,
avg_sil = sil_scores
)
ggplot(sil_tbl, aes(x = k, y = avg_sil)) +
geom_line() +
geom_point() +
labs(title = "Average Silhouette Width vs. Number of Clusters",
x = "Number of Clusters (k)", y = "Average Silhouette Width")
# pick k maximizing silhouette
best_k <- sil_tbl$k[which.max(sil_tbl$avg_sil)]
best_k
## [1] 2
Average silhouette is maximized at k = 2 and drops sharply afterwards, so we choose two clusters.
km_final <- kmeans(pc_scores, centers = best_k, nstart = 50)
clusters <- km_final$cluster
cluster_df <- as_tibble(pc_scores) %>%
mutate(
cluster = factor(clusters),
y = y
)
# 2D visualization with first two PCs
ggplot(cluster_df, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.7) +
labs(
title = paste("Patient Clusters in PCA Space (k =", best_k, ")"),
x = "PC1", y = "PC2", color = "Cluster"
)
#----------------------------------------------------------
# 4.3 Cluster profiles on original variables
#----------------------------------------------------------
cluster_profiles <- cluster_df %>%
bind_cols(as_tibble(X)) %>%
group_by(cluster) %>%
summarise(across(starts_with("X"), mean), .groups = "drop")
cluster_profiles
## # A tibble: 2 × 31
## cluster X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.329 0.328 0.337 0.211 0.473 0.470 0.423 0.376 0.506 0.495 0.145
## 2 2 0.247 0.282 0.239 0.137 0.356 0.172 0.0931 0.122 0.336 0.252 0.0599
## # ℹ 19 more variables: X12 <dbl>, X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>,
## # X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>,
## # X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>, X27 <dbl>, X28 <dbl>,
## # X29 <dbl>, X30 <dbl>
#visuals
cluster_profiles_long <- cluster_profiles %>%
pivot_longer(cols = starts_with("X"), names_to = "variable", values_to = "mean_val")
ggplot(cluster_profiles_long, aes(x = variable, y = cluster, fill = mean_val)) +
geom_tile() +
coord_flip() +
labs(title = "Cluster Profiles on Original Health Indicators",
x = "Health Indicator", y = "Cluster") +
scale_fill_viridis_c()
Interpretation: The algorithm finds a small number of patient segments
that share similar health profiles. Differences across clusters in
specific indicator patterns may correspond to different risk levels,
which we evaluate next using the cancer labels.
Cluster 1 tends to have higher values on most indicators (especially X1, X5, X9…), indicating generally more abnormal readings, while Cluster 2 has lower/near-normal values. This suggests Cluster 1 might correspond to a sicker subgroup.
We now test the hypothesis: “Patients with highly abnormal health profiles (outliers) are more likely to have cancer.” We use two outlier detection strategies: a. Mahalanobis distance in the standardized original feature space. b. kNN-distance (local density-based).
#----------------------------------------------------------
# Mahalanobis distance based on scaled X
#----------------------------------------------------------
cov_mat <- cov(X_scaled)
center_vec <- colMeans(X_scaled)
md <- mahalanobis(X_scaled, center = center_vec, cov = cov_mat)
outlier_df <- tibble(
md = md,
y = y
)
ggplot(outlier_df, aes(x = md)) +
geom_histogram(bins = 30) +
labs(title = "Distribution of Mahalanobis Distances",
x = "Mahalanobis Distance", y = "Count")
Mahalanobis distance measures how far each data point is from the
multivariate center. Observations in the right tail correspond to
unusually abnormal combinations of health indicators.
# Choose a high quantile as cutoff (e.g., top 5% most extreme)
cutoff_md <- quantile(md, 0.95)
outlier_md_flag <- ifelse(md > cutoff_md, 1, 0)
table(outlier_md_flag)
## outlier_md_flag
## 0 1
## 359 19
mean(outlier_md_flag)
## [1] 0.05026455
We use the top 5% as outliers to focus on extreme abnormality. This threshold can be adjusted based on the sensitivity Michigan Medicine wants to achieve.
Approximately 5% of patients are flagged as outliers. If outliers contain a higher proportion of cancer cases, this would support the hypothesis that cancer patients exhibit abnormal multivariate profiles.
#----------------------------------------------------------
# kNN distance for outlier detection
#----------------------------------------------------------
set.seed(123)
k <- 10
knn_res <- kNNdist(X_scaled, k = k)
knn_dist <- knn_res
knn_df <- tibble(
knn_dist = knn_dist,
y = y
)
ggplot(knn_df, aes(x = knn_dist)) +
geom_histogram(bins = 30) +
labs(title = paste0("Distribution of ", k, "-NN Distances"),
x = paste0(k, "-NN Distance"), y = "Count")
# Top 5% by kNN distance as outliers
cutoff_knn <- quantile(knn_dist, 0.95)
outlier_knn_flag <- ifelse(knn_dist > cutoff_knn, 1, 0)
table(outlier_knn_flag)
## outlier_knn_flag
## 0 1
## 359 19
mean(outlier_knn_flag)
## [1] 0.05026455
For global kNN outliers we use uses k = 10 (minPts) to define denser neighborhoods.
#----------------------------------------------------------
# 5.3 DBSCAN Clustering
#----------------------------------------------------------
# 1. Use kNN distance plot to choose eps
k <- 10
kNNdistplot(X_scaled, k = k)
abline(h = quantile(kNNdist(X_scaled, k), 0.95), col = "red", lty = 2)
Interpretation: The “elbow” or sharp increase in the kNN-distance plot helps choose eps. Here, we test eps at the 95th percentile of the 10-NN distances.
# choose eps based on visual elbow or top quantile
eps_val <- quantile(kNNdist(X_scaled, k), 0.95)
db <- dbscan(X_scaled, eps = eps_val, minPts = k)
db
## DBSCAN clustering for 378 objects.
## Parameters: eps = 7.08410700906318, minPts = 10
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 10 noise points.
##
## 0 1
## 10 368
##
## Available fields: cluster, eps, minPts, metric, borderPoints
DBSCAN finds one main dense cluster and 10 noise points. Clinically, this means most patients look similar in the high-dimensional health space, with a small set of very unusual profiles (noise)
Visualize DBSCAN Clusters in PCA space
db_df <- as_tibble(pc_scores[, 1:2]) %>%
# columns are already named PC1 and PC2
mutate(
db_cluster = factor(db$cluster),
y = y
)
ggplot(db_df, aes(x = PC1, y = PC2, color = db_cluster)) +
geom_point(alpha = 0.7) +
labs(
title = "DBSCAN Clusters in PCA Space",
color = "DBSCAN Cluster"
)
The pink points (cluster 1) form the main group; the grey points
(cluster 0) are DBSCAN noise, corresponding to potential high-risk
anomalies.
In this section, we evaluate the performance of each unsupervised method—clustering, Mahalanobis outliers, kNN outliers, and DBSCAN—using the ground-truth cancer labels only for validation.
Because cancer is a rare but high-cost condition, Sensitivity (True Positive Rate) is the most important metric: missing a cancer case (False Negative) is clinically unacceptable. Specificity and Precision are also reported, but are secondary since false alarms are less harmful than missed diagnoses. We now evaluate each unsupervised method using the ground-truth label y.
For cancer detection: Positive class = Cancer (y = 1)
Define a helper function:
#----------------------------------------------------------
# Helper: Confusion matrix and metrics
#----------------------------------------------------------
calc_metrics <- function(truth, pred, positive = 1){
truth <- as.integer(truth)
pred <- as.integer(pred)
TP <- sum(truth == positive & pred == positive)
TN <- sum(truth != positive & pred != positive)
FP <- sum(truth != positive & pred == positive)
FN <- sum(truth == positive & pred != positive)
sensitivity <- ifelse((TP + FN) == 0, NA, TP / (TP + FN)) # recall / TPR
specificity <- ifelse((TN + FP) == 0, NA, TN / (TN + FP))
precision <- ifelse((TP + FP) == 0, NA, TP / (TP + FP))
accuracy <- (TP + TN) / length(truth)
tibble(
TP = TP, FP = FP, FN = FN, TN = TN,
Sensitivity = sensitivity,
Specificity = specificity,
Precision = precision,
Accuracy = accuracy
)
}
pred_dbscan <- ifelse(db$cluster == 0, 1, 0)
metrics_dbscan <- calc_metrics(truth = y, pred = pred_dbscan)
metrics_dbscan
## # A tibble: 1 × 8
## TP FP FN TN Sensitivity Specificity Precision Accuracy
## <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 7 3 14 354 0.333 0.992 0.7 0.955
We define a simple rule: For each cluster, compute the cancer rate. Any cluster whose cancer rate is above the overall prevalence is labeled a “high-risk cluster”. Patients in high-risk clusters are classified as “cancer = 1”; others as “cancer = 0”.
#----------------------------------------------------------
# Evaluate clustering as a cancer detector
#----------------------------------------------------------
overall_prev <- mean(y)
cluster_eval_df <- cluster_df %>%
group_by(cluster) %>%
summarise(
n = n(),
cancer_rate = mean(y == 1),
.groups = "drop"
)
cluster_eval_df
## # A tibble: 2 × 3
## cluster n cancer_rate
## <fct> <int> <dbl>
## 1 1 37 0.405
## 2 2 341 0.0176
# High-risk clusters: cancer rate > overall prevalence
high_risk_clusters <- cluster_eval_df %>%
filter(cancer_rate > overall_prev) %>%
pull(cluster)
high_risk_clusters
## [1] 1
## Levels: 1 2
# Predicted cancer label based on cluster rule
pred_cluster <- ifelse(cluster_df$cluster %in% high_risk_clusters, 1, 0)
metrics_cluster <- calc_metrics(truth = y, pred = pred_cluster)
metrics_cluster
## # A tibble: 1 × 8
## TP FP FN TN Sensitivity Specificity Precision Accuracy
## <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 15 22 6 335 0.714 0.938 0.405 0.926
We now treat outliers as “predicted cancer”:
# Mahalanobis-based rule
pred_md <- outlier_md_flag # 1 = outlier => predicted cancer
metrics_md <- calc_metrics(truth = y, pred = pred_md)
metrics_md
## # A tibble: 1 × 8
## TP FP FN TN Sensitivity Specificity Precision Accuracy
## <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 9 10 12 347 0.429 0.972 0.474 0.942
# kNN-based rule
pred_knn <- outlier_knn_flag
metrics_knn <- calc_metrics(truth = y, pred = pred_knn)
metrics_knn
## # A tibble: 1 × 8
## TP FP FN TN Sensitivity Specificity Precision Accuracy
## <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 10 9 11 348 0.476 0.975 0.526 0.947
#compare all
metrics_all <- bind_rows(
metrics_cluster %>% mutate(Method = "Clustering on PCs"),
metrics_md %>% mutate(Method = "Mahalanobis Distance Outlier"),
metrics_knn %>% mutate(Method = paste0(k, "-NN Distance Outlier"),)
) %>%
select(Method, everything())
metrics_all <- bind_rows(
metrics_all,
metrics_dbscan %>% mutate(Method = "DBSCAN Outlier (Noise Points)")
)
metrics_all
## # A tibble: 4 × 9
## Method TP FP FN TN Sensitivity Specificity Precision Accuracy
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Clustering… 15 22 6 335 0.714 0.938 0.405 0.926
## 2 Mahalanobi… 9 10 12 347 0.429 0.972 0.474 0.942
## 3 10-NN Dist… 10 9 11 348 0.476 0.975 0.526 0.947
## 4 DBSCAN Out… 7 3 14 354 0.333 0.992 0.7 0.955
Idea: DBSCAN labels low-density points as noise → noise patients may be cancer cases.
# noise points = predicted cancer
pred_dbscan <- ifelse(db$cluster == 0, 1, 0)
metrics_dbscan <- calc_metrics(truth = y, pred = pred_dbscan)
metrics_dbscan
## # A tibble: 1 × 8
## TP FP FN TN Sensitivity Specificity Precision Accuracy
## <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 7 3 14 354 0.333 0.992 0.7 0.955
Interpretation: Sensitivity tells us what fraction of true cancer patients our unsupervised rule successfully flags. Specificity tells us what fraction of non-cancer patients we correctly classify as low risk.
Depending on the method and threshold: Outlier rules often achieve higher sensitivity but at the cost of many false positives, meaning more patients would need follow-up tests. Cluster-based rules may give slightly more balanced trade-offs but can still miss a non-trivial fraction of cancer cases. Because cancer prevalence is low in this sample, a naive rule that labels everyone as “no cancer” would have good accuracy but zero sensitivity—which is unacceptable clinically. Our unsupervised methods improve on that by identifying a non-trivial share of cancer cases, though they still miss some.
From our unsupervised analysis: The health indicators do contain signal related to cancer status:
Suggestions: - Use this approach as a screening / triage tool, not as a standalone diagnostic. - It is suitable for flagging higher-risk patients for more intensive evaluation (e.g., imaging, specialist referral). - A negative result from these unsupervised rules must not be interpreted as “no cancer.”
Overall Recommendation:
Overall, we recommend adopting the cluster-based risk flag combined with Mahalanobis outlier checks as an early-warning signal, while continuing to rely on clinical judgment and confirmatory tests for diagnosis.
Based on the analysis, a reasonable workflow for Michigan Medicine could be: a. Preprocess & Compress Data: - Standardize all health indicators. - Use PCA to reduce to a small number of components capturing ~80% of variance (e.g., 5–7 PCs).
Label availability & representativeness: Our evaluation assumes the existing labeled sample is representative. If not, the performance estimates may not generalize to broader populations.
Feature set: We only used the available indicators. Adding richer data (e.g., longitudinal trends, imaging summaries, genetic markers) could substantially improve unsupervised separation.
Hybrid approaches: In practice, once labels are available at scale, a semi-supervised or supervised model (e.g., gradient boosting with regularization) will likely outperform pure unsupervised rules. The unsupervised approach remains useful for early-stage exploration and anomaly detection in large unlabeled populations.
Next steps: Pilot the unsupervised risk score on a larger retrospective cohort and confirm that sensitivity is acceptable at operational thresholds. Integrate clinician feedback to tune the “actionable” risk level. Gradually transition to semi-supervised / supervised models as more labeled data become available.