Στόχος: Πρόβλεψη κακοήθειας από κυτταρολογικά χαρακτηριστικά
Σαν data scientists σε ένα ογκολογικό κέντρο και έχουμε αναλάβει την ανάλυση 699 βιοψιών. Στόχος μας είναι η δημιουργία ενός υποστηρικτικού μοντέλου που θα προβλέπει αν ένας όγκος είναι καλοήθης (benign) ή κακοήθης (malignant), δίνοντας έμφαση στην ερμηνευσιμότητα και την υψηλή ευαισθησία.
# --- Φόρτωση πακέτων ---
library(tidyverse)
library(mlbench)
library(randomForest)
library(xgboost)
library(caret)
library(pROC)
library(kableExtra)
set.seed(42)
# --- Φόρτωση δεδομένων ---
data("BreastCancer", package = "mlbench")
bc <- BreastCancer
# Καθαρισμός: αφαίρεση ID, χειρισμός missing, μετατροπή σε numeric
bc$Id <- NULL
bc <- na.omit(bc) # Αφαίρεση γραμμών με NAs
bc[, 1:9] <- lapply(bc[, 1:9], function(x) as.numeric(as.character(x)))
# Έλεγχος δομής και κατανομής της target μεταβλητής
str(bc)
## 'data.frame': 683 obs. of 10 variables:
## $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:16] 24 41 140 146 159 165 236 250 276 293 ...
## ..- attr(*, "names")= chr [1:16] "24" "41" "140" "146" ...
table(bc$Class)
##
## benign malignant
## 444 239
Το dataset μας έχει μέτρια ασυμμετρία στις κλάσεις, οπότε θα χρησιμοποιήσουμε stratified split για να διατηρήσουμε την αναλογία.
Θα χωρίσουμε τα δεδομένα με βάση την κατανομή της target μεταβλητής.
bc_clean <- na.omit(bc)
train_idx <- createDataPartition(bc$Class, p = 0.7, list = FALSE)
train <- bc[train_idx, ]
test <- bc[-train_idx, ]
# Έλεγχος αναλογιών
prop.table(table(train$Class))
##
## benign malignant
## 0.6492693 0.3507307
prop.table(table(test$Class))
##
## benign malignant
## 0.6519608 0.3480392
Στόχος: μοντέλο που προβλέπει το Class
rf_model <- randomForest(
Class ~ .,
data = train,
ntree = 500,
importance = TRUE
)
print(rf_model)
##
## Call:
## randomForest(formula = Class ~ ., data = train, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.34%
## Confusion matrix:
## benign malignant class.error
## benign 301 10 0.03215434
## malignant 6 162 0.03571429
# Προβλέψεις (κλάσεις)
rf_pred <- predict(rf_model, test)
# Πιθανότητες (για το AUC χρειαζόμαστε την πιθανότητα της positive κλάσης)
# Θεωρούμε positive το "malignant" (κακοήθης)
rf_prob <- predict(rf_model, test, type = "prob")[, "malignant"]
# Confusion Matrix
rf_cm <- confusionMatrix(rf_pred, test$Class, positive = "malignant")
print(rf_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction benign malignant
## benign 131 2
## malignant 2 69
##
## Accuracy : 0.9804
## 95% CI : (0.9506, 0.9946)
## No Information Rate : 0.652
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9568
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9718
## Specificity : 0.9850
## Pos Pred Value : 0.9718
## Neg Pred Value : 0.9850
## Prevalence : 0.3480
## Detection Rate : 0.3382
## Detection Prevalence : 0.3480
## Balanced Accuracy : 0.9784
##
## 'Positive' Class : malignant
##
# AUC
rf_auc <- roc(test$Class, rf_prob, levels = c("benign", "malignant"))$auc
cat("RF AUC:", round(rf_auc, 3), "\n")
## RF AUC: 0.998
Εξετάζουμε ποια χαρακτηριστικά δίνουν τη μεγαλύτερη πληροφορία στο μοντέλο.
varImpPlot(rf_model, main = "Variable Importance - Random Forest")
importance(rf_model)
## benign malignant MeanDecreaseAccuracy MeanDecreaseGini
## Cl.thickness 13.435500 21.454365 19.592274 12.492423
## Cell.size 11.976588 15.948203 19.900875 58.301150
## Cell.shape 10.011263 21.083813 22.370249 53.470968
## Marg.adhesion 4.951799 13.487285 13.528026 6.915326
## Epith.c.size 10.025874 4.762791 11.224042 12.617603
## Bare.nuclei 19.069891 20.454103 24.973998 36.838280
## Bl.cromatin 5.943408 14.599084 16.079432 13.093756
## Normal.nucleoli 11.431501 11.709221 15.365021 22.570651
## Mitoses 4.732594 1.147448 4.710676 1.322268
Bare.nuclei ,
Cell.shape και Cell.size.Το XGBoost δεν δέχεται data.frames απευθείας, οπότε φτιάχνουμε αριθμητικά matrices.
# Αφαιρούμε την 10η στήλη (Class) για τα features
train_x <- data.matrix(train[, -which(names(train) == "Class")])
# Μετατρέπουμε το target σε 0/1 (malignant = 1)
train_y <- ifelse(train$Class == "malignant", 1, 0)
test_x <- data.matrix(test[, -which(names(test) == "Class")])
test_y <- ifelse(test$Class == "malignant", 1, 0)
# Δημιουργία xgb.DMatrix
dtrain <- xgb.DMatrix(data = train_x, label = train_y)
dtest <- xgb.DMatrix(data = test_x, label = test_y)
Ορίζουμε early stopping ώστε το μοντέλο να σταματήσει όταν το test AUC δεν βελτιώνεται για 20 γύρους.
set.seed(42)
params <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 4,
eta = 0.1,
subsample = 0.8,
colsample_bytree = 0.8
)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 500,
watchlist = list(train = dtrain, test = dtest),
early_stopping_rounds = 20,
verbose = 1,
)
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
##
## [1] train-auc:0.977109 test-auc:0.967489
## [2] train-auc:0.977109 test-auc:0.983798
## [3] train-auc:0.990985 test-auc:0.994334
## [4] train-auc:0.993052 test-auc:0.993858
## [5] train-auc:0.992880 test-auc:0.992905
## [6] train-auc:0.993961 test-auc:0.994493
## [7] train-auc:0.994028 test-auc:0.994334
## [8] train-auc:0.994134 test-auc:0.994652
## [9] train-auc:0.994985 test-auc:0.995817
## [10] train-auc:0.995043 test-auc:0.995288
## [11] train-auc:0.994890 test-auc:0.994864
## [12] train-auc:0.995100 test-auc:0.995076
## [13] train-auc:0.995732 test-auc:0.994546
## [14] train-auc:0.995885 test-auc:0.994970
## [15] train-auc:0.996210 test-auc:0.995076
## [16] train-auc:0.996057 test-auc:0.995182
## [17] train-auc:0.996325 test-auc:0.995340
## [18] train-auc:0.996517 test-auc:0.995340
## [19] train-auc:0.996746 test-auc:0.995658
## [20] train-auc:0.996880 test-auc:0.995552
## [21] train-auc:0.997148 test-auc:0.995552
## [22] train-auc:0.997186 test-auc:0.995446
## [23] train-auc:0.997263 test-auc:0.995552
## [24] train-auc:0.997378 test-auc:0.995446
## [25] train-auc:0.997742 test-auc:0.995764
## [26] train-auc:0.997722 test-auc:0.995446
## [27] train-auc:0.998009 test-auc:0.995764
## [28] train-auc:0.998239 test-auc:0.995870
## [29] train-auc:0.998335 test-auc:0.995976
## [30] train-auc:0.998431 test-auc:0.995976
## [31] train-auc:0.998469 test-auc:0.996188
## [32] train-auc:0.998660 test-auc:0.995976
## [33] train-auc:0.998565 test-auc:0.995976
## [34] train-auc:0.998641 test-auc:0.995976
## [35] train-auc:0.998756 test-auc:0.996188
## [36] train-auc:0.998890 test-auc:0.995976
## [37] train-auc:0.998909 test-auc:0.995870
## [38] train-auc:0.998947 test-auc:0.995764
## [39] train-auc:0.999005 test-auc:0.995764
## [40] train-auc:0.999120 test-auc:0.995870
## [41] train-auc:0.999196 test-auc:0.996294
## [42] train-auc:0.999215 test-auc:0.996505
## [43] train-auc:0.999292 test-auc:0.996505
## [44] train-auc:0.999273 test-auc:0.996717
## [45] train-auc:0.999311 test-auc:0.996823
## [46] train-auc:0.999426 test-auc:0.996823
## [47] train-auc:0.999445 test-auc:0.997141
## [48] train-auc:0.999445 test-auc:0.997353
## [49] train-auc:0.999445 test-auc:0.997564
## [50] train-auc:0.999502 test-auc:0.997458
## [51] train-auc:0.999541 test-auc:0.997458
## [52] train-auc:0.999560 test-auc:0.997564
## [53] train-auc:0.999541 test-auc:0.997353
## [54] train-auc:0.999598 test-auc:0.997670
## [55] train-auc:0.999598 test-auc:0.997670
## [56] train-auc:0.999655 test-auc:0.997458
## [57] train-auc:0.999655 test-auc:0.997353
## [58] train-auc:0.999655 test-auc:0.997458
## [59] train-auc:0.999675 test-auc:0.997564
## [60] train-auc:0.999675 test-auc:0.997458
## [61] train-auc:0.999694 test-auc:0.997670
## [62] train-auc:0.999675 test-auc:0.997353
## [63] train-auc:0.999713 test-auc:0.997353
## [64] train-auc:0.999770 test-auc:0.997353
## [65] train-auc:0.999751 test-auc:0.997353
## [66] train-auc:0.999789 test-auc:0.997353
## [67] train-auc:0.999789 test-auc:0.997353
## [68] train-auc:0.999789 test-auc:0.997141
## [69] train-auc:0.999789 test-auc:0.997035
## [70] train-auc:0.999789 test-auc:0.997353
## [71] train-auc:0.999789 test-auc:0.997353
## [72] train-auc:0.999809 test-auc:0.997458
## [73] train-auc:0.999809 test-auc:0.997564
## [74] train-auc:0.999809 test-auc:0.997776
## [75] train-auc:0.999847 test-auc:0.997882
## [76] train-auc:0.999847 test-auc:0.997564
## [77] train-auc:0.999847 test-auc:0.997564
## [78] train-auc:0.999866 test-auc:0.997776
## [79] train-auc:0.999847 test-auc:0.997670
## [80] train-auc:0.999847 test-auc:0.997670
## [81] train-auc:0.999866 test-auc:0.997776
## [82] train-auc:0.999866 test-auc:0.997882
## [83] train-auc:0.999866 test-auc:0.997670
## [84] train-auc:0.999885 test-auc:0.997670
## [85] train-auc:0.999866 test-auc:0.997670
## [86] train-auc:0.999866 test-auc:0.997670
## [87] train-auc:0.999866 test-auc:0.997564
## [88] train-auc:0.999866 test-auc:0.997564
## [89] train-auc:0.999866 test-auc:0.997670
## [90] train-auc:0.999885 test-auc:0.997564
## [91] train-auc:0.999904 test-auc:0.997458
## [92] train-auc:0.999923 test-auc:0.997458
## [93] train-auc:0.999923 test-auc:0.997458
## [94] train-auc:0.999943 test-auc:0.997564
## Stopping. Best iteration:
## [95] train-auc:0.999923 test-auc:0.997458
##
## [95] train-auc:0.999923 test-auc:0.997458
Αρχικά αξιολογούμε το XGBoost.
xgb_prob <- predict(xgb_model, dtest)
xgb_pred <- factor(ifelse(xgb_prob > 0.5, "malignant", "benign"),
levels = c("benign", "malignant"))
xgb_cm <- confusionMatrix(xgb_pred, test$Class, positive = "malignant")
xgb_auc <- roc(test$Class, xgb_prob, levels = c("benign", "malignant"))$auc
# Δημιουργία πίνακα σύγκρισης (results dataframe)
results <- data.frame(
Model = c("Random Forest", "XGBoost"),
Accuracy = c(rf_cm$overall["Accuracy"], xgb_cm$overall["Accuracy"]),
Sensitivity = c(rf_cm$byClass["Sensitivity"], xgb_cm$byClass["Sensitivity"]),
Specificity = c(rf_cm$byClass["Specificity"], xgb_cm$byClass["Specificity"]),
AUC = c(rf_auc, xgb_auc)
) %>% mutate(across(where(is.numeric), ~ round(.x, 4)))
| Model | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|
| Random Forest | 0.9804 | 0.9718 | 0.985 | 0.9982 |
| XGBoost | 0.9755 | 0.9577 | 0.985 | 0.9979 |
Θα δοκιμάσουμε ένα πολύ μικρό (0.01) και ένα μεγάλο (0.3) learning rate.
set.seed(42)
# eta = 0.01
params_slow <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 4,
eta = 0.01
)
xgb_slow <- xgb.train(
params = params_slow,
data = dtrain,
nrounds = 500,
watchlist = list(train=dtrain, test=dtest),
early_stopping_rounds = 20,
verbose = 1
)
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
##
## [1] train-auc:0.989148 test-auc:0.987557
## [2] train-auc:0.989263 test-auc:0.988457
## [3] train-auc:0.989272 test-auc:0.988775
## [4] train-auc:0.989311 test-auc:0.988351
## [5] train-auc:0.989311 test-auc:0.988775
## [6] train-auc:0.989406 test-auc:0.988775
## [7] train-auc:0.989406 test-auc:0.988351
## [8] train-auc:0.989406 test-auc:0.988351
## [9] train-auc:0.989406 test-auc:0.988351
## [10] train-auc:0.989406 test-auc:0.988351
## [11] train-auc:0.989406 test-auc:0.988775
## [12] train-auc:0.989406 test-auc:0.988351
## [13] train-auc:0.989406 test-auc:0.988775
## [14] train-auc:0.989406 test-auc:0.988775
## [15] train-auc:0.989406 test-auc:0.988351
## [16] train-auc:0.989406 test-auc:0.988351
## [17] train-auc:0.989406 test-auc:0.988775
## [18] train-auc:0.991866 test-auc:0.989145
## [19] train-auc:0.991904 test-auc:0.988722
## [20] train-auc:0.991981 test-auc:0.989040
## [21] train-auc:0.991981 test-auc:0.988616
## [22] train-auc:0.992134 test-auc:0.988616
## [23] train-auc:0.992019 test-auc:0.989040
## [24] train-auc:0.992019 test-auc:0.988828
## [25] train-auc:0.992134 test-auc:0.988616
## [26] train-auc:0.992134 test-auc:0.988616
## [27] train-auc:0.992134 test-auc:0.989040
## [28] train-auc:0.992134 test-auc:0.988616
## [29] train-auc:0.992134 test-auc:0.988510
## [30] train-auc:0.992497 test-auc:0.988510
## [31] train-auc:0.992497 test-auc:0.988934
## [32] train-auc:0.992497 test-auc:0.988510
## [33] train-auc:0.992497 test-auc:0.988934
## [34] train-auc:0.992622 test-auc:0.988934
## [35] train-auc:0.993502 test-auc:0.994970
## [36] train-auc:0.993617 test-auc:0.994440
## [37] train-auc:0.993674 test-auc:0.994440
## [38] train-auc:0.993674 test-auc:0.994440
## [39] train-auc:0.993713 test-auc:0.994440
## [40] train-auc:0.993789 test-auc:0.994440
## [41] train-auc:0.993847 test-auc:0.994017
## [42] train-auc:0.993770 test-auc:0.994123
## [43] train-auc:0.993961 test-auc:0.994440
## [44] train-auc:0.994019 test-auc:0.994546
## [45] train-auc:0.994057 test-auc:0.994440
## [46] train-auc:0.994000 test-auc:0.994017
## [47] train-auc:0.993981 test-auc:0.994017
## [48] train-auc:0.994000 test-auc:0.994070
## [49] train-auc:0.994268 test-auc:0.994123
## [50] train-auc:0.994268 test-auc:0.994123
## [51] train-auc:0.994344 test-auc:0.994546
## [52] train-auc:0.994517 test-auc:0.994546
## [53] train-auc:0.994631 test-auc:0.994758
## [54] train-auc:0.994631 test-auc:0.994546
## Stopping. Best iteration:
## [55] train-auc:0.994651 test-auc:0.994811
##
## [55] train-auc:0.994651 test-auc:0.994811
# eta = 0.3
params_fast <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 4,
eta = 0.3
)
xgb_fast <- xgb.train(
params = params_fast,
data = dtrain,
nrounds = 500,
watchlist = list(train=dtrain, test=dtest),
early_stopping_rounds = 20,
verbose = 1
)
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
##
## [1] train-auc:0.989148 test-auc:0.987557
## [2] train-auc:0.992737 test-auc:0.992852
## [3] train-auc:0.994000 test-auc:0.994917
## [4] train-auc:0.995885 test-auc:0.994440
## [5] train-auc:0.996459 test-auc:0.994546
## [6] train-auc:0.997273 test-auc:0.994864
## [7] train-auc:0.998076 test-auc:0.995711
## [8] train-auc:0.998593 test-auc:0.996135
## [9] train-auc:0.998919 test-auc:0.995658
## [10] train-auc:0.999349 test-auc:0.995552
## [11] train-auc:0.999330 test-auc:0.995870
## [12] train-auc:0.999598 test-auc:0.995870
## [13] train-auc:0.999732 test-auc:0.995976
## [14] train-auc:0.999732 test-auc:0.996294
## [15] train-auc:0.999732 test-auc:0.996188
## [16] train-auc:0.999713 test-auc:0.996717
## [17] train-auc:0.999847 test-auc:0.996929
## [18] train-auc:0.999828 test-auc:0.997035
## [19] train-auc:0.999866 test-auc:0.997035
## [20] train-auc:0.999904 test-auc:0.997141
## [21] train-auc:0.999923 test-auc:0.996929
## [22] train-auc:0.999943 test-auc:0.996929
## [23] train-auc:0.999943 test-auc:0.996823
## [24] train-auc:0.999943 test-auc:0.997035
## [25] train-auc:0.999943 test-auc:0.997035
## [26] train-auc:0.999981 test-auc:0.997141
## [27] train-auc:0.999981 test-auc:0.997141
## [28] train-auc:0.999981 test-auc:0.997141
## [29] train-auc:0.999981 test-auc:0.997141
## [30] train-auc:0.999981 test-auc:0.997247
## [31] train-auc:1.000000 test-auc:0.997141
## [32] train-auc:1.000000 test-auc:0.997247
## [33] train-auc:1.000000 test-auc:0.997141
## [34] train-auc:1.000000 test-auc:0.997141
## [35] train-auc:1.000000 test-auc:0.997247
## [36] train-auc:1.000000 test-auc:0.997141
## [37] train-auc:1.000000 test-auc:0.997141
## [38] train-auc:1.000000 test-auc:0.997035
## [39] train-auc:1.000000 test-auc:0.997035
## [40] train-auc:1.000000 test-auc:0.997035
## [41] train-auc:1.000000 test-auc:0.997035
## [42] train-auc:1.000000 test-auc:0.996929
## [43] train-auc:1.000000 test-auc:0.996929
## [44] train-auc:1.000000 test-auc:0.996929
## [45] train-auc:1.000000 test-auc:0.996929
## [46] train-auc:1.000000 test-auc:0.996929
## [47] train-auc:1.000000 test-auc:0.996823
## [48] train-auc:1.000000 test-auc:0.996823
## [49] train-auc:1.000000 test-auc:0.996611
## Stopping. Best iteration:
## [50] train-auc:1.000000 test-auc:0.996611
##
## [50] train-auc:1.000000 test-auc:0.996611
Παρατηρείται ότι χρειάστηκε πολύ πιο λίγα βήματα περίπου 50-55
Όσο πιο κοντά στην πάνω-αριστερή γωνία, τόσο καλύτερα.
roc_rf <- roc(test$Class, rf_prob, levels = c("benign", "malignant"))
roc_xgb <- roc(test$Class, xgb_prob, levels = c("benign", "malignant"))
plot(roc_rf, col = "forestgreen", lwd = 2, main = "ROC Curves: RF vs XGBoost")
lines(roc_xgb, col = "steelblue", lwd = 2)
legend("bottomright",
legend = c(paste0("RF (AUC = ", round(rf_auc, 3), ")"),
paste0("XGB (AUC = ", round(xgb_auc, 3), ")")),
col = c("forestgreen", "steelblue"),
lwd = 2)
eta καθορίζει το πόσο “εμπιστεύεται” κάθε νέο δέντρο. Με
eta = 0.01 (πιο αργή εκμάθηση), το μοντέλο χρειάζεται 55
γύρους πριν γίνει το early stopping. Αντίθετα, με
eta = 0.3, “μαθαίνει” πιο γρήγορα και το early stopping
ενεργοποιείται νωρίτερα (50 γύρους), ίσως όμως χάνοντας λίγο σε
γενίκευση.