============================================================
LAB 010 — Random Forests & Gradient Boosting
Dataset: Breast Cancer (Wisconsin)
Στόχος: Πρόβλεψη κακοήθειας από κυτταρολογικά χαρακτηριστικά
============================================================
## Warning: package 'tidyverse' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'mlbench' was built under R version 4.5.3
## Warning: package 'randomForest' was built under R version 4.5.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'xgboost' was built under R version 4.5.3
## Warning: package 'caret' was built under R version 4.5.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
bc$Id <- NULL #διαγράφουμε την στήλη ID
bc <- na.omit(bc) #διαγράφουμε εγγραφές με κενές τιμές
bc[, 1:9] <- lapply(bc[, 1:9], function(x) as.numeric(as.character(x)))
str(bc)## 'data.frame': 683 obs. of 10 variables:
## $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:16] 24 41 140 146 159 165 236 250 276 293 ...
## ..- attr(*, "names")= chr [1:16] "24" "41" "140" "146" ...
##
## benign malignant
## 444 239
============================================================
============================================================
(κρατάμε τις ίδιες αναλογίες των κατηγοριών και στα 2 σύνολα)
Hint: createDataPartition() από το caret
idx <- createDataPartition(bc$Class, p = 0.7, list = FALSE)
train <- bc[idx, ]
test <- bc[-idx, ]
# Έλεγχος ότι οι αναλογίες διατηρήθηκαν (stratified split)
prop.table(table(train$Class)) %>% round(3)##
## benign malignant
## 0.649 0.351
##
## benign malignant
## 0.652 0.348
##
## Train set: 479 παρατηρήσεις
## Test set: 204 παρατηρήσεις
Στόχος: μοντέλο που προβλέπει το Class
set.seed(42)
rf_model <- randomForest(
Class ~ ., # Όλα τα features
data = train,
ntree = 500, # Αριθμός δέντρων
importance = TRUE # Για να πάρουμε variable importance
)
print(rf_model)##
## Call:
## randomForest(formula = Class ~ ., data = train, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.76%
## Confusion matrix:
## benign malignant class.error
## benign 301 10 0.03215434
## malignant 8 160 0.04761905
Hint: confusionMatrix() + roc()
# Προβλέψεις
rf_pred_class <- predict(rf_model, newdata = test, type = "response")
rf_pred_prob <- predict(rf_model, newdata = test, type = "prob")[, "malignant"]
# Confusion Matrix
cm_rf <- confusionMatrix(rf_pred_class, test$Class, positive = "malignant")
print(cm_rf)## Confusion Matrix and Statistics
##
## Reference
## Prediction benign malignant
## benign 131 2
## malignant 2 69
##
## Accuracy : 0.9804
## 95% CI : (0.9506, 0.9946)
## No Information Rate : 0.652
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9568
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9718
## Specificity : 0.9850
## Pos Pred Value : 0.9718
## Neg Pred Value : 0.9850
## Prevalence : 0.3480
## Detection Rate : 0.3382
## Detection Prevalence : 0.3480
## Balanced Accuracy : 0.9784
##
## 'Positive' Class : malignant
##
## Setting direction: controls < cases
##
## AUC (Random Forest): 0.9985
rf_results <- tibble(
Μεταβλητή = c("Accuracy", "Sensitivity", "Specificity", "AUC"),
Τιμή = c(
round(cm_rf$overall["Accuracy"], 4),
round(cm_rf$byClass["Sensitivity"], 4),
round(cm_rf$byClass["Specificity"], 4),
round(auc_rf, 4)
)
)
knitr::kable(rf_results, caption = "Αποτελέσματα Random Forest στο Test Set")| Μεταβλητή | Τιμή |
|---|---|
| Accuracy | 0.9804 |
| Sensitivity | 0.9718 |
| Specificity | 0.9850 |
| AUC | 0.9985 |
# Plot Variable Importance
varImpPlot(
rf_model,
main = "Variable Importance — Random Forest",
col = "#8F00FF",
pch = 19
)## Accuracy of Random Forest: 98.04 %
imp <- importance(rf_model)
imp_df <- data.frame(
Feature = rownames(imp),
MeanDecreaseAccuracy = imp[, "MeanDecreaseAccuracy"],
MeanDecreaseGini = imp[, "MeanDecreaseGini"]
) %>% arrange(desc(MeanDecreaseGini))
cat("Top-3 Features:\n")## Top-3 Features:
## Feature MeanDecreaseAccuracy MeanDecreaseGini
## Cell.size Cell.size 19.82619 60.34435
## Cell.shape Cell.shape 21.80028 53.56115
## Bare.nuclei Bare.nuclei 26.03557 34.15482
Cell.Size (Uniformity of Cell Size): η πιο σημαντική μεταβλητή
Cell.Shape (Uniformity of Cell Shape): στενά συνδεδεμένη με το Cell.Size
Bare.Nuclei: η παρουσία γυμνών πυρήνων είναι ισχυρός δείκτης κακοήθειας
Αυτά τα ευρήματα είναι ιατρικά λογικά καθώς η ανομοιομορφία κυττάρων και γυμνοί πυρήνες είναι κλασικά κυτταρολογικά χαρακτηριστικά κακοήθειας.
Όχι απλά αυτό το ερώτημα δεν έχει μία απάντηση — και αυτό είναι κρίσιμο.
Σε ιατρικό diagnostic context, η Accuracy από μόνη της είναι παραπλανητική. Αυτό που έχει σημασία είναι:
Με Sensitivity ~97%+ και AUC ~0.99, το μοντέλο είναι εξαιρετικό ως εργαλείο υποστήριξης απόφασης, αλλά δεν αντικαθιστά τον κλινικό παθολόγο. Στην πράξη, το threshold θα ρυθμιζόταν ώστε να μεγιστοποιεί τη Sensitivity, αποδεχόμενοι κάποια μείωση της Specificity.
============================================================
============================================================
Hint: as.matrix() + ifelse() για το target
# Μετατροπή σε numeric matrix
train_x <- as.matrix(train[, 1:9])
train_y <- ifelse(train$Class == "malignant", 1, 0)
test_x <- as.matrix(test[, 1:9])
test_y <- ifelse(test$Class == "malignant", 1, 0)
# Δημιουργία xgb.DMatrix (το optimized format του XGBoost)
dtrain <- xgb.DMatrix(data = train_x, label = train_y)
dtest <- xgb.DMatrix(data = test_x, label = test_y)Παράμετροι: max_depth=4, eta=0.1, nrounds=500, early_stopping_rounds=20
set.seed(42)
params <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 4,
eta = 0.1, # learning rate
subsample = 0.8,
colsample_bytree = 0.8
)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 500,
watchlist = list(train = dtrain, test = dtest),
early_stopping_rounds = 20,
print_every_n = 25,
verbose = 1
)## Warning in throw_err_or_depr_msg("Parameter '", match_old, "' has been renamed
## to '", : Parameter 'watchlist' has been renamed to 'evals'. This warning will
## become an error in a future version.
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
##
## [1] train-auc:0.977109 test-auc:0.967489
## [26] train-auc:0.997722 test-auc:0.995446
## [51] train-auc:0.999541 test-auc:0.997458
## [76] train-auc:0.999847 test-auc:0.997564
## Stopping. Best iteration:
## [95] train-auc:0.999923 test-auc:0.997458
##
## [95] train-auc:0.999923 test-auc:0.997458
# Προβλέψεις XGBoost
xgb_prob <- predict(xgb_model, dtest)
xgb_pred <- factor(ifelse(xgb_prob > 0.5, "malignant", "benign"),
levels = c("benign", "malignant"))
xgb_cm <- confusionMatrix(xgb_pred, test$Class, positive = "malignant")
print(xgb_cm)## Confusion Matrix and Statistics
##
## Reference
## Prediction benign malignant
## benign 131 3
## malignant 2 68
##
## Accuracy : 0.9755
## 95% CI : (0.9437, 0.992)
## No Information Rate : 0.652
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9458
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9577
## Specificity : 0.9850
## Pos Pred Value : 0.9714
## Neg Pred Value : 0.9776
## Prevalence : 0.3480
## Detection Rate : 0.3333
## Detection Prevalence : 0.3431
## Balanced Accuracy : 0.9714
##
## 'Positive' Class : malignant
##
## Setting direction: controls < cases
## XGBoost AUC: 0.998
Στήλες: Accuracy, Sensitivity, Specificity, AUC
comparison <- tibble(
Μοντέλο = c("Random Forest", "XGBoost"),
Accuracy = c(
round(cm_rf$overall["Accuracy"], 4),
round(xgb_cm$overall["Accuracy"], 4)
),
Sensitivity = c(
round(cm_rf$byClass["Sensitivity"], 4),
round(xgb_cm$byClass["Sensitivity"], 4)
),
Specificity = c(
round(cm_rf$byClass["Specificity"], 4),
round(xgb_cm$byClass["Specificity"], 4)
),
AUC = c(round(auc_rf, 4), round(xgb_auc, 4))
)
knitr::kable(comparison, caption = "Random Forest vs XGBoost")| Μοντέλο | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|
| Random Forest | 0.9804 | 0.9718 | 0.985 | 0.9985 |
| XGBoost | 0.9755 | 0.9577 | 0.985 | 0.9979 |
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: Cell.shape 0.41314930 0.17212968 0.10379242
## 2: Cell.size 0.26421488 0.14343242 0.09580838
## 3: Bare.nuclei 0.12418836 0.20984672 0.17964072
## 4: Cl.thickness 0.04996820 0.10978109 0.13772455
## 5: Normal.nucleoli 0.04993893 0.11101276 0.13972056
## 6: Bl.cromatin 0.04290336 0.10546435 0.08782435
## 7: Marg.adhesion 0.03863542 0.09280645 0.16566866
## 8: Epith.c.size 0.01144765 0.03618452 0.06986028
## 9: Mitoses 0.00555390 0.01934201 0.01996008
# Μέτρηση χρόνου
time_rf <- system.time(randomForest(Class ~ ., data = train, ntree = 500))
time_xgb <- system.time(xgb.train(params, dtrain, nrounds = 100))
data.frame(
Model = c("Random Forest", "XGBoost"),
Seconds = c(time_rf[3], time_xgb[3])
)## Model Seconds
## 1 Random Forest 0.08
## 2 XGBoost 0.28
Και τα δύο μοντέλα συμφωνούν στα ίδια top-3 features (Cell.Size, Cell.Shape, Bare.Nuclei), παρά τη διαφορετική τους λογική (bagging vs boosting). Αυτό λειτουργεί ως sanity check καθώς όταν δύο ανεξάρτητοι αλγόριθμοι δείχνουν τα ίδια χαρακτηριστικά, αυξάνεται η εμπιστοσύνη ότι είναι πραγματικά πληροφοριακά και όχι απλά τυχαίο artifact. Η μικρή διαφορά στη σειρά κατάταξης είναι αναμενόμενη, καθώς RF μετράει με MeanDecreaseGini και XGBoost με Gain.
run_xgb <- function(eta_val) {
params_eta <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 4,
eta = eta_val,
subsample = 0.8,
colsample_bytree = 0.8
)
m <- xgb.train(
params = params_eta,
data = dtrain,
nrounds = 500,
evals = list(train = dtrain, test = dtest),
early_stopping_rounds = 20,
verbose = 1
)
list(
eta = eta_val,
best_iter = m$best_iteration,
best_auc = round(as.numeric(m$best_score), 4),
model = m
)
}
res_001 <- run_xgb(0.01)## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
##
## [1] train-auc:0.983444 test-auc:0.983321
## [2] train-auc:0.985894 test-auc:0.986392
## [3] train-auc:0.990181 test-auc:0.994758
## [4] train-auc:0.990067 test-auc:0.994758
## [5] train-auc:0.990650 test-auc:0.994493
## [6] train-auc:0.990526 test-auc:0.995076
## [7] train-auc:0.991473 test-auc:0.995870
## [8] train-auc:0.992086 test-auc:0.996082
## [9] train-auc:0.991990 test-auc:0.996294
## [10] train-auc:0.991894 test-auc:0.996188
## [11] train-auc:0.992172 test-auc:0.996452
## [12] train-auc:0.992038 test-auc:0.996135
## [13] train-auc:0.992153 test-auc:0.996399
## [14] train-auc:0.992076 test-auc:0.995976
## [15] train-auc:0.992153 test-auc:0.996082
## [16] train-auc:0.992095 test-auc:0.996082
## [17] train-auc:0.992172 test-auc:0.996188
## [18] train-auc:0.992229 test-auc:0.996188
## [19] train-auc:0.992296 test-auc:0.995976
## [20] train-auc:0.992277 test-auc:0.996082
## [21] train-auc:0.992976 test-auc:0.995976
## [22] train-auc:0.992976 test-auc:0.996082
## [23] train-auc:0.992909 test-auc:0.996082
## [24] train-auc:0.992804 test-auc:0.995870
## [25] train-auc:0.992842 test-auc:0.995976
## [26] train-auc:0.992861 test-auc:0.995976
## [27] train-auc:0.993493 test-auc:0.995764
## [28] train-auc:0.993531 test-auc:0.995764
## [29] train-auc:0.993646 test-auc:0.995764
## [30] train-auc:0.993550 test-auc:0.995870
## Stopping. Best iteration:
## [31] train-auc:0.993627 test-auc:0.995976
##
## [31] train-auc:0.993627 test-auc:0.995976
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
##
## [1] train-auc:0.975568 test-auc:0.973472
## [2] train-auc:0.982602 test-auc:0.980779
## [3] train-auc:0.989062 test-auc:0.992005
## [4] train-auc:0.993052 test-auc:0.992322
## [5] train-auc:0.995493 test-auc:0.994070
## [6] train-auc:0.996086 test-auc:0.993752
## [7] train-auc:0.996076 test-auc:0.993540
## [8] train-auc:0.995962 test-auc:0.993434
## [9] train-auc:0.995875 test-auc:0.993646
## [10] train-auc:0.995761 test-auc:0.993434
## [11] train-auc:0.996067 test-auc:0.992905
## [12] train-auc:0.996067 test-auc:0.993805
## [13] train-auc:0.996153 test-auc:0.994123
## [14] train-auc:0.996459 test-auc:0.994123
## [15] train-auc:0.996373 test-auc:0.994864
## [16] train-auc:0.996622 test-auc:0.994758
## [17] train-auc:0.996718 test-auc:0.995446
## [18] train-auc:0.996765 test-auc:0.995764
## [19] train-auc:0.996919 test-auc:0.995764
## [20] train-auc:0.997206 test-auc:0.995658
## [21] train-auc:0.997167 test-auc:0.995976
## [22] train-auc:0.997512 test-auc:0.996294
## [23] train-auc:0.997550 test-auc:0.995870
## [24] train-auc:0.997703 test-auc:0.995976
## [25] train-auc:0.997799 test-auc:0.996294
## [26] train-auc:0.997990 test-auc:0.996082
## [27] train-auc:0.998124 test-auc:0.995976
## [28] train-auc:0.998124 test-auc:0.995976
## [29] train-auc:0.998297 test-auc:0.995976
## [30] train-auc:0.998526 test-auc:0.995764
## [31] train-auc:0.998603 test-auc:0.995976
## [32] train-auc:0.998603 test-auc:0.996188
## [33] train-auc:0.998832 test-auc:0.995976
## [34] train-auc:0.998852 test-auc:0.996294
## [35] train-auc:0.998832 test-auc:0.995870
## [36] train-auc:0.998890 test-auc:0.995870
## [37] train-auc:0.998871 test-auc:0.995976
## [38] train-auc:0.998909 test-auc:0.996082
## [39] train-auc:0.999081 test-auc:0.996188
## [40] train-auc:0.999215 test-auc:0.996188
## [41] train-auc:0.999273 test-auc:0.996399
## [42] train-auc:0.999273 test-auc:0.996611
## [43] train-auc:0.999196 test-auc:0.996294
## [44] train-auc:0.999254 test-auc:0.996399
## [45] train-auc:0.999311 test-auc:0.996399
## [46] train-auc:0.999388 test-auc:0.996399
## [47] train-auc:0.999407 test-auc:0.996188
## [48] train-auc:0.999464 test-auc:0.996505
## [49] train-auc:0.999464 test-auc:0.996717
## [50] train-auc:0.999445 test-auc:0.996823
## [51] train-auc:0.999522 test-auc:0.996823
## [52] train-auc:0.999541 test-auc:0.996823
## [53] train-auc:0.999522 test-auc:0.996717
## [54] train-auc:0.999598 test-auc:0.996823
## [55] train-auc:0.999636 test-auc:0.996823
## [56] train-auc:0.999636 test-auc:0.996929
## [57] train-auc:0.999636 test-auc:0.996823
## [58] train-auc:0.999675 test-auc:0.996929
## [59] train-auc:0.999694 test-auc:0.997141
## [60] train-auc:0.999694 test-auc:0.997035
## [61] train-auc:0.999732 test-auc:0.997035
## [62] train-auc:0.999751 test-auc:0.997035
## [63] train-auc:0.999770 test-auc:0.997035
## [64] train-auc:0.999770 test-auc:0.997035
## [65] train-auc:0.999770 test-auc:0.997035
## [66] train-auc:0.999770 test-auc:0.997035
## [67] train-auc:0.999770 test-auc:0.997141
## [68] train-auc:0.999770 test-auc:0.997141
## [69] train-auc:0.999770 test-auc:0.997141
## [70] train-auc:0.999770 test-auc:0.997141
## [71] train-auc:0.999789 test-auc:0.997141
## [72] train-auc:0.999809 test-auc:0.997141
## [73] train-auc:0.999828 test-auc:0.997141
## [74] train-auc:0.999828 test-auc:0.997035
## [75] train-auc:0.999866 test-auc:0.997035
## [76] train-auc:0.999847 test-auc:0.997035
## [77] train-auc:0.999847 test-auc:0.996929
## [78] train-auc:0.999847 test-auc:0.996929
## Stopping. Best iteration:
## [79] train-auc:0.999885 test-auc:0.996929
##
## [79] train-auc:0.999885 test-auc:0.996929
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
##
## [1] train-auc:0.979167 test-auc:0.960447
## [2] train-auc:0.983435 test-auc:0.985651
## [3] train-auc:0.993100 test-auc:0.993593
## [4] train-auc:0.994746 test-auc:0.992852
## [5] train-auc:0.996928 test-auc:0.992058
## [6] train-auc:0.997225 test-auc:0.992005
## [7] train-auc:0.997665 test-auc:0.994176
## [8] train-auc:0.997933 test-auc:0.995023
## [9] train-auc:0.998373 test-auc:0.994811
## [10] train-auc:0.998565 test-auc:0.994070
## [11] train-auc:0.998507 test-auc:0.994176
## [12] train-auc:0.998775 test-auc:0.994387
## [13] train-auc:0.998928 test-auc:0.995340
## [14] train-auc:0.999081 test-auc:0.995446
## [15] train-auc:0.998947 test-auc:0.995340
## [16] train-auc:0.999158 test-auc:0.995023
## [17] train-auc:0.999483 test-auc:0.994493
## [18] train-auc:0.999598 test-auc:0.994705
## [19] train-auc:0.999541 test-auc:0.995235
## [20] train-auc:0.999694 test-auc:0.995976
## [21] train-auc:0.999636 test-auc:0.995976
## [22] train-auc:0.999770 test-auc:0.996294
## [23] train-auc:0.999866 test-auc:0.996294
## [24] train-auc:0.999828 test-auc:0.996505
## [25] train-auc:0.999847 test-auc:0.996294
## [26] train-auc:0.999904 test-auc:0.996929
## [27] train-auc:0.999943 test-auc:0.996082
## [28] train-auc:0.999943 test-auc:0.996611
## [29] train-auc:0.999904 test-auc:0.996611
## [30] train-auc:0.999904 test-auc:0.995976
## [31] train-auc:0.999962 test-auc:0.995870
## [32] train-auc:0.999962 test-auc:0.995976
## [33] train-auc:0.999962 test-auc:0.995870
## [34] train-auc:0.999981 test-auc:0.995976
## [35] train-auc:0.999962 test-auc:0.995976
## [36] train-auc:0.999962 test-auc:0.996399
## [37] train-auc:0.999981 test-auc:0.996294
## [38] train-auc:0.999981 test-auc:0.996505
## [39] train-auc:0.999981 test-auc:0.996611
## [40] train-auc:0.999981 test-auc:0.996823
## [41] train-auc:0.999981 test-auc:0.996929
## [42] train-auc:0.999981 test-auc:0.996823
## [43] train-auc:0.999981 test-auc:0.996505
## [44] train-auc:0.999981 test-auc:0.996611
## [45] train-auc:0.999981 test-auc:0.996611
## Stopping. Best iteration:
## [46] train-auc:0.999981 test-auc:0.996505
##
## [46] train-auc:0.999981 test-auc:0.996505
| eta | Συμπεριφορά |
|---|---|
| 0.01 | Χρειάζονται Πολλοί Γύροι (Αργή σύγκλιση,μικρά βήματα) |
| 0.10 | Καλό AUC με μέτριο αριθμό γύρων (Ισορροπημένο) |
| 0.30 | Χρειάζονται Λίγοι Γύροι (Υπάρχει κίνδυνος πήδηματος πάνω από το βέλτιστο) |
Το eta=0.1 είναι η το καλύτερο για αυτό το dataset. Οι
πολύ μικρές τιμές (0.01) απαιτούν υπολογιστικό κόστος χωρίς ανάλογο
όφελος, ενώ οι πολύ μεγάλες (0.3+) μπορεί να οδηγήσουν σε instability ή
suboptimal σύγκλιση.
(RF και XGBoost) στο ίδιο γράφημα
## Setting direction: controls < cases
## Setting direction: controls < cases
plot(roc_rf, col = "forestgreen", lwd = 2, main = "ROC Curves Comparison")
lines(roc_xgb, col = "#8B0000", lwd = 2)
legend("bottomright",
legend = c(paste0("RF (AUC = ", round(auc_rf, 3), ")"),
paste0("XGB (AUC = ", round(xgb_auc, 3), ")"),
col = c("forestgreen", "#8B0000"),
lwd = 2))Και οι δύο καμπύλες βρίσκονται πολύ κοντά στην πάνω αριστερή γωνία, οπότε και τα δυο τα μοντέλα έχουν εξαιρετική ικανότητα διαχωρισμού των δύο κλάσεων (benign vs malignant). Η ελαφρά διαφορά στο AUC (~0.001-0.005) δεν είναι στατιστικά σημαντική σε αυτό το μέγεθος δείγματος.