============================================================

LAB 010 — Random Forests & Gradient Boosting

Dataset: Breast Cancer (Wisconsin)

Στόχος: Πρόβλεψη κακοήθειας από κυτταρολογικά χαρακτηριστικά

============================================================

Φόρτωση πακέτων

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.5.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.5.3

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(xgboost)

## Warning: package 'xgboost' was built under R version 4.5.3

library(caret)

## Warning: package 'caret' was built under R version 4.5.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(pROC)

## Warning: package 'pROC' was built under R version 4.5.3

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

set.seed(42)

Φόρτωση δεδομένων

data("BreastCancer", package = "mlbench")
bc <- BreastCancer

Καθαρισμός: αφαίρεση ID, χειρισμός missing, μετατροπή σε numeric

bc$Id <- NULL #διαγράφουμε την στήλη ID
bc <- na.omit(bc) #διαγράφουμε εγγραφές με κενές τιμές
bc[, 1:9] <- lapply(bc[, 1:9], function(x) as.numeric(as.character(x)))

str(bc)

## 'data.frame':    683 obs. of  10 variables:
##  $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : num  1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:16] 24 41 140 146 159 165 236 250 276 293 ...
##   ..- attr(*, "names")= chr [1:16] "24" "41" "140" "146" ...

table(bc$Class)

## 
##    benign malignant 
##       444       239

============================================================

ΜΕΡΟΣ Α — BASELINE με Random Forest

============================================================

TODO 1: Κάνε stratified train/test split (70/30)

(κρατάμε τις ίδιες αναλογίες των κατηγοριών και στα 2 σύνολα)

Hint: createDataPartition() από το caret

idx <- createDataPartition(bc$Class, p = 0.7, list = FALSE)
train <- bc[idx, ]
test <- bc[-idx, ]

# Έλεγχος ότι οι αναλογίες διατηρήθηκαν (stratified split)
prop.table(table(train$Class)) %>% round(3)

## 
##    benign malignant 
##     0.649     0.351

prop.table(table(test$Class))  %>% round(3)

## 
##    benign malignant 
##     0.652     0.348

cat("\nTrain set:", nrow(train), "παρατηρήσεις\n")

## 
## Train set: 479 παρατηρήσεις

cat("Test set: ", nrow(test), "παρατηρήσεις\n\n")

## Test set:  204 παρατηρήσεις

TODO 2: Εκπαίδευσε ένα Random Forest με ntree=500, importance=TRUE

Στόχος: μοντέλο που προβλέπει το Class

set.seed(42)
rf_model <- randomForest(
  Class ~ .,                  # Όλα τα features
  data = train,
  ntree = 500,                # Αριθμός δέντρων
  importance = TRUE          # Για να πάρουμε variable importance
)

print(rf_model)

## 
## Call:
##  randomForest(formula = Class ~ ., data = train, ntree = 500,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 3.76%
## Confusion matrix:
##           benign malignant class.error
## benign       301        10  0.03215434
## malignant      8       160  0.04761905

TODO 3: Υπολόγισε Accuracy, Sensitivity, AUC στο test set

Hint: confusionMatrix() + roc()

# Προβλέψεις
rf_pred_class <- predict(rf_model, newdata = test, type = "response")
rf_pred_prob  <- predict(rf_model, newdata = test, type = "prob")[, "malignant"]

# Confusion Matrix
cm_rf <- confusionMatrix(rf_pred_class, test$Class, positive = "malignant")
print(cm_rf)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       131         2
##   malignant      2        69
##                                           
##                Accuracy : 0.9804          
##                  95% CI : (0.9506, 0.9946)
##     No Information Rate : 0.652           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9568          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.9850          
##          Pos Pred Value : 0.9718          
##          Neg Pred Value : 0.9850          
##              Prevalence : 0.3480          
##          Detection Rate : 0.3382          
##    Detection Prevalence : 0.3480          
##       Balanced Accuracy : 0.9784          
##                                           
##        'Positive' Class : malignant       
##

# AUC
roc_rf  <- roc(test$Class, rf_pred_prob, levels = c("benign", "malignant"))

## Setting direction: controls < cases

auc_rf  <- auc(roc_rf)
cat("\nAUC (Random Forest):", round(auc_rf, 4), "\n")

## 
## AUC (Random Forest): 0.9985

Συνοπτικά Αποτελέσματα RF

rf_results <- tibble(
  Μεταβλητή = c("Accuracy", "Sensitivity", "Specificity", "AUC"),
  Τιμή = c(
    round(cm_rf$overall["Accuracy"], 4),
    round(cm_rf$byClass["Sensitivity"], 4),
    round(cm_rf$byClass["Specificity"], 4),
    round(auc_rf, 4)
  )
)
knitr::kable(rf_results, caption = "Αποτελέσματα Random Forest στο Test Set")

Αποτελέσματα Random Forest στο Test Set
Μεταβλητή	Τιμή
Accuracy	0.9804
Sensitivity	0.9718
Specificity	0.9850
AUC	0.9985

TODO 4: Δείξε το Variable Importance plot

# Plot Variable Importance
varImpPlot(
  rf_model,
  main = "Variable Importance — Random Forest",
  col = "#8F00FF",
  pch = 19
)

Απάντηση Ερωτήσεων

Ερώτηση 1: Accuracy

## Accuracy of Random Forest: 98.04 %

Ερώτηση 2: Top-3 Features

imp <- importance(rf_model)
imp_df <- data.frame(
  Feature = rownames(imp),
  MeanDecreaseAccuracy = imp[, "MeanDecreaseAccuracy"],
  MeanDecreaseGini = imp[, "MeanDecreaseGini"]
) %>% arrange(desc(MeanDecreaseGini))

cat("Top-3 Features:\n")

## Top-3 Features:

print(head(imp_df, 3))

##                 Feature MeanDecreaseAccuracy MeanDecreaseGini
## Cell.size     Cell.size             19.82619         60.34435
## Cell.shape   Cell.shape             21.80028         53.56115
## Bare.nuclei Bare.nuclei             26.03557         34.15482

Cell.Size (Uniformity of Cell Size): η πιο σημαντική μεταβλητή
Cell.Shape (Uniformity of Cell Shape): στενά συνδεδεμένη με το Cell.Size
Bare.Nuclei: η παρουσία γυμνών πυρήνων είναι ισχυρός δείκτης κακοήθειας

Αυτά τα ευρήματα είναι ιατρικά λογικά καθώς η ανομοιομορφία κυττάρων και γυμνοί πυρήνες είναι κλασικά κυτταρολογικά χαρακτηριστικά κακοήθειας.

Ερώτηση 3: Είναι 97% Accuracy “αρκετό” σε ιατρικό context;

Όχι απλά αυτό το ερώτημα δεν έχει μία απάντηση — και αυτό είναι κρίσιμο.

Σε ιατρικό diagnostic context, η Accuracy από μόνη της είναι παραπλανητική. Αυτό που έχει σημασία είναι:

Sensitivity (Recall for malignant): Δηλαδή το πόσα πραγματικά κακοήθη περιστατικά εντοπίζονται. Ένα False Negative (δεν εντοπίστηκε καρκίνος ενώ υπήρχε) είναι πολύ πιο επικίνδυνο από ένα False Positive.
Specificity: Το πόσοι υγιείς ασθενείς γλειτώνουν από περιττές βιοψίες.

Με Sensitivity ~97%+ και AUC ~0.99, το μοντέλο είναι εξαιρετικό ως εργαλείο υποστήριξης απόφασης, αλλά δεν αντικαθιστά τον κλινικό παθολόγο. Στην πράξη, το threshold θα ρυθμιζόταν ώστε να μεγιστοποιεί τη Sensitivity, αποδεχόμενοι κάποια μείωση της Specificity.

============================================================

ΜΕΡΟΣ Β — BOOSTING & TUNING

============================================================

TODO 5: Προετοίμασε τα δεδομένα για XGBoost

Hint: as.matrix() + ifelse() για το target

# Μετατροπή σε numeric matrix
train_x <- as.matrix(train[, 1:9])
train_y <- ifelse(train$Class == "malignant", 1, 0)

test_x <- as.matrix(test[, 1:9])
test_y <- ifelse(test$Class == "malignant", 1, 0)

# Δημιουργία xgb.DMatrix (το optimized format του XGBoost)
dtrain <- xgb.DMatrix(data = train_x, label = train_y)
dtest  <- xgb.DMatrix(data = test_x,  label = test_y)

TODO 6: Εκπαίδευσε ένα XGBoost μοντέλο με early stopping

Παράμετροι: max_depth=4, eta=0.1, nrounds=500, early_stopping_rounds=20

set.seed(42)
params <- list(
  objective   = "binary:logistic",
  eval_metric = "auc",
  max_depth   = 4,
  eta         = 0.1,        # learning rate
  subsample   = 0.8,
  colsample_bytree = 0.8
)

xgb_model <- xgb.train(
  params              = params,
  data                = dtrain,
  nrounds             = 500,
  watchlist           = list(train = dtrain, test = dtest),
  early_stopping_rounds = 20,
  print_every_n       = 25,
  verbose             = 1
)

## Warning in throw_err_or_depr_msg("Parameter '", match_old, "' has been renamed
## to '", : Parameter 'watchlist' has been renamed to 'evals'. This warning will
## become an error in a future version.

## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
## 
## [1]  train-auc:0.977109  test-auc:0.967489 
## [26] train-auc:0.997722  test-auc:0.995446 
## [51] train-auc:0.999541  test-auc:0.997458 
## [76] train-auc:0.999847  test-auc:0.997564 
## Stopping. Best iteration:
## [95] train-auc:0.999923  test-auc:0.997458
## 
## [95] train-auc:0.999923  test-auc:0.997458

# Προβλέψεις XGBoost
xgb_prob <- predict(xgb_model, dtest)
xgb_pred <- factor(ifelse(xgb_prob > 0.5, "malignant", "benign"),
                   levels = c("benign", "malignant"))

xgb_cm <- confusionMatrix(xgb_pred, test$Class, positive = "malignant")
print(xgb_cm)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       131         3
##   malignant      2        68
##                                          
##                Accuracy : 0.9755         
##                  95% CI : (0.9437, 0.992)
##     No Information Rate : 0.652          
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9458         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9577         
##             Specificity : 0.9850         
##          Pos Pred Value : 0.9714         
##          Neg Pred Value : 0.9776         
##              Prevalence : 0.3480         
##          Detection Rate : 0.3333         
##    Detection Prevalence : 0.3431         
##       Balanced Accuracy : 0.9714         
##                                          
##        'Positive' Class : malignant      
##

xgb_auc <- roc(test$Class, xgb_prob, levels = c("benign", "malignant"))$auc

## Setting direction: controls < cases

cat("XGBoost AUC:", round(xgb_auc, 3), "\n")

## XGBoost AUC: 0.998

TODO 7: Σύγκρινε XGBoost vs Random Forest σε ένα πίνακα

Στήλες: Accuracy, Sensitivity, Specificity, AUC

comparison <- tibble(
  Μοντέλο     = c("Random Forest", "XGBoost"),
  Accuracy    = c(
    round(cm_rf$overall["Accuracy"],   4),
    round(xgb_cm$overall["Accuracy"],  4)
  ),
  Sensitivity = c(
    round(cm_rf$byClass["Sensitivity"],  4),
    round(xgb_cm$byClass["Sensitivity"], 4)
  ),
  Specificity = c(
    round(cm_rf$byClass["Specificity"],  4),
    round(xgb_cm$byClass["Specificity"], 4)
  ),
  AUC = c(round(auc_rf, 4), round(xgb_auc, 4))
)

knitr::kable(comparison, caption = "Random Forest vs XGBoost")

Random Forest vs XGBoost
Μοντέλο	Accuracy	Sensitivity	Specificity	AUC
Random Forest	0.9804	0.9718	0.985	0.9985
XGBoost	0.9755	0.9577	0.985	0.9979

imp_xgb <- xgb.importance(model = xgb_model)
print(imp_xgb)

##            Feature       Gain      Cover  Frequency
##             <char>      <num>      <num>      <num>
## 1:      Cell.shape 0.41314930 0.17212968 0.10379242
## 2:       Cell.size 0.26421488 0.14343242 0.09580838
## 3:     Bare.nuclei 0.12418836 0.20984672 0.17964072
## 4:    Cl.thickness 0.04996820 0.10978109 0.13772455
## 5: Normal.nucleoli 0.04993893 0.11101276 0.13972056
## 6:     Bl.cromatin 0.04290336 0.10546435 0.08782435
## 7:   Marg.adhesion 0.03863542 0.09280645 0.16566866
## 8:    Epith.c.size 0.01144765 0.03618452 0.06986028
## 9:         Mitoses 0.00555390 0.01934201 0.01996008

xgb.plot.importance(imp_xgb, top_n = 8, 
                    main = "Feature Importance — XGBoost")

# Μέτρηση χρόνου
time_rf  <- system.time(randomForest(Class ~ ., data = train, ntree = 500))
time_xgb <- system.time(xgb.train(params, dtrain, nrounds = 100))

data.frame(
  Model   = c("Random Forest", "XGBoost"),
  Seconds = c(time_rf[3], time_xgb[3])
)

##           Model Seconds
## 1 Random Forest    0.08
## 2       XGBoost    0.28

Συμπέρασμα

Και τα δύο μοντέλα συμφωνούν στα ίδια top-3 features (Cell.Size, Cell.Shape, Bare.Nuclei), παρά τη διαφορετική τους λογική (bagging vs boosting). Αυτό λειτουργεί ως sanity check καθώς όταν δύο ανεξάρτητοι αλγόριθμοι δείχνουν τα ίδια χαρακτηριστικά, αυξάνεται η εμπιστοσύνη ότι είναι πραγματικά πληροφοριακά και όχι απλά τυχαίο artifact. Η μικρή διαφορά στη σειρά κατάταξης είναι αναμενόμενη, καθώς RF μετράει με MeanDecreaseGini και XGBoost με Gain.

TODO 8 (BONUS): Δοκίμασε ΔΥΟ διαφορετικά eta (0.01 και 0.3)

run_xgb <- function(eta_val) {
  params_eta <- list(
    objective = "binary:logistic",
    eval_metric = "auc",
    max_depth = 4,
    eta = eta_val,
    subsample = 0.8,
    colsample_bytree = 0.8
  )
  m <- xgb.train(
    params = params_eta,
    data = dtrain,
    nrounds = 500,
    evals = list(train = dtrain, test = dtest),
    early_stopping_rounds = 20,
    verbose = 1
  )
  list(
    eta = eta_val,
    best_iter = m$best_iteration,
    best_auc = round(as.numeric(m$best_score), 4),
    model = m
  )
}

res_001 <- run_xgb(0.01)

## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
## 
## [1]  train-auc:0.983444  test-auc:0.983321 
## [2]  train-auc:0.985894  test-auc:0.986392 
## [3]  train-auc:0.990181  test-auc:0.994758 
## [4]  train-auc:0.990067  test-auc:0.994758 
## [5]  train-auc:0.990650  test-auc:0.994493 
## [6]  train-auc:0.990526  test-auc:0.995076 
## [7]  train-auc:0.991473  test-auc:0.995870 
## [8]  train-auc:0.992086  test-auc:0.996082 
## [9]  train-auc:0.991990  test-auc:0.996294 
## [10] train-auc:0.991894  test-auc:0.996188 
## [11] train-auc:0.992172  test-auc:0.996452 
## [12] train-auc:0.992038  test-auc:0.996135 
## [13] train-auc:0.992153  test-auc:0.996399 
## [14] train-auc:0.992076  test-auc:0.995976 
## [15] train-auc:0.992153  test-auc:0.996082 
## [16] train-auc:0.992095  test-auc:0.996082 
## [17] train-auc:0.992172  test-auc:0.996188 
## [18] train-auc:0.992229  test-auc:0.996188 
## [19] train-auc:0.992296  test-auc:0.995976 
## [20] train-auc:0.992277  test-auc:0.996082 
## [21] train-auc:0.992976  test-auc:0.995976 
## [22] train-auc:0.992976  test-auc:0.996082 
## [23] train-auc:0.992909  test-auc:0.996082 
## [24] train-auc:0.992804  test-auc:0.995870 
## [25] train-auc:0.992842  test-auc:0.995976 
## [26] train-auc:0.992861  test-auc:0.995976 
## [27] train-auc:0.993493  test-auc:0.995764 
## [28] train-auc:0.993531  test-auc:0.995764 
## [29] train-auc:0.993646  test-auc:0.995764 
## [30] train-auc:0.993550  test-auc:0.995870 
## Stopping. Best iteration:
## [31] train-auc:0.993627  test-auc:0.995976
## 
## [31] train-auc:0.993627  test-auc:0.995976

res_01 <- run_xgb(0.1)

## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
## 
## [1]  train-auc:0.975568  test-auc:0.973472 
## [2]  train-auc:0.982602  test-auc:0.980779 
## [3]  train-auc:0.989062  test-auc:0.992005 
## [4]  train-auc:0.993052  test-auc:0.992322 
## [5]  train-auc:0.995493  test-auc:0.994070 
## [6]  train-auc:0.996086  test-auc:0.993752 
## [7]  train-auc:0.996076  test-auc:0.993540 
## [8]  train-auc:0.995962  test-auc:0.993434 
## [9]  train-auc:0.995875  test-auc:0.993646 
## [10] train-auc:0.995761  test-auc:0.993434 
## [11] train-auc:0.996067  test-auc:0.992905 
## [12] train-auc:0.996067  test-auc:0.993805 
## [13] train-auc:0.996153  test-auc:0.994123 
## [14] train-auc:0.996459  test-auc:0.994123 
## [15] train-auc:0.996373  test-auc:0.994864 
## [16] train-auc:0.996622  test-auc:0.994758 
## [17] train-auc:0.996718  test-auc:0.995446 
## [18] train-auc:0.996765  test-auc:0.995764 
## [19] train-auc:0.996919  test-auc:0.995764 
## [20] train-auc:0.997206  test-auc:0.995658 
## [21] train-auc:0.997167  test-auc:0.995976 
## [22] train-auc:0.997512  test-auc:0.996294 
## [23] train-auc:0.997550  test-auc:0.995870 
## [24] train-auc:0.997703  test-auc:0.995976 
## [25] train-auc:0.997799  test-auc:0.996294 
## [26] train-auc:0.997990  test-auc:0.996082 
## [27] train-auc:0.998124  test-auc:0.995976 
## [28] train-auc:0.998124  test-auc:0.995976 
## [29] train-auc:0.998297  test-auc:0.995976 
## [30] train-auc:0.998526  test-auc:0.995764 
## [31] train-auc:0.998603  test-auc:0.995976 
## [32] train-auc:0.998603  test-auc:0.996188 
## [33] train-auc:0.998832  test-auc:0.995976 
## [34] train-auc:0.998852  test-auc:0.996294 
## [35] train-auc:0.998832  test-auc:0.995870 
## [36] train-auc:0.998890  test-auc:0.995870 
## [37] train-auc:0.998871  test-auc:0.995976 
## [38] train-auc:0.998909  test-auc:0.996082 
## [39] train-auc:0.999081  test-auc:0.996188 
## [40] train-auc:0.999215  test-auc:0.996188 
## [41] train-auc:0.999273  test-auc:0.996399 
## [42] train-auc:0.999273  test-auc:0.996611 
## [43] train-auc:0.999196  test-auc:0.996294 
## [44] train-auc:0.999254  test-auc:0.996399 
## [45] train-auc:0.999311  test-auc:0.996399 
## [46] train-auc:0.999388  test-auc:0.996399 
## [47] train-auc:0.999407  test-auc:0.996188 
## [48] train-auc:0.999464  test-auc:0.996505 
## [49] train-auc:0.999464  test-auc:0.996717 
## [50] train-auc:0.999445  test-auc:0.996823 
## [51] train-auc:0.999522  test-auc:0.996823 
## [52] train-auc:0.999541  test-auc:0.996823 
## [53] train-auc:0.999522  test-auc:0.996717 
## [54] train-auc:0.999598  test-auc:0.996823 
## [55] train-auc:0.999636  test-auc:0.996823 
## [56] train-auc:0.999636  test-auc:0.996929 
## [57] train-auc:0.999636  test-auc:0.996823 
## [58] train-auc:0.999675  test-auc:0.996929 
## [59] train-auc:0.999694  test-auc:0.997141 
## [60] train-auc:0.999694  test-auc:0.997035 
## [61] train-auc:0.999732  test-auc:0.997035 
## [62] train-auc:0.999751  test-auc:0.997035 
## [63] train-auc:0.999770  test-auc:0.997035 
## [64] train-auc:0.999770  test-auc:0.997035 
## [65] train-auc:0.999770  test-auc:0.997035 
## [66] train-auc:0.999770  test-auc:0.997035 
## [67] train-auc:0.999770  test-auc:0.997141 
## [68] train-auc:0.999770  test-auc:0.997141 
## [69] train-auc:0.999770  test-auc:0.997141 
## [70] train-auc:0.999770  test-auc:0.997141 
## [71] train-auc:0.999789  test-auc:0.997141 
## [72] train-auc:0.999809  test-auc:0.997141 
## [73] train-auc:0.999828  test-auc:0.997141 
## [74] train-auc:0.999828  test-auc:0.997035 
## [75] train-auc:0.999866  test-auc:0.997035 
## [76] train-auc:0.999847  test-auc:0.997035 
## [77] train-auc:0.999847  test-auc:0.996929 
## [78] train-auc:0.999847  test-auc:0.996929 
## Stopping. Best iteration:
## [79] train-auc:0.999885  test-auc:0.996929
## 
## [79] train-auc:0.999885  test-auc:0.996929

res_03 <- run_xgb(0.3)

## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
## 
## [1]  train-auc:0.979167  test-auc:0.960447 
## [2]  train-auc:0.983435  test-auc:0.985651 
## [3]  train-auc:0.993100  test-auc:0.993593 
## [4]  train-auc:0.994746  test-auc:0.992852 
## [5]  train-auc:0.996928  test-auc:0.992058 
## [6]  train-auc:0.997225  test-auc:0.992005 
## [7]  train-auc:0.997665  test-auc:0.994176 
## [8]  train-auc:0.997933  test-auc:0.995023 
## [9]  train-auc:0.998373  test-auc:0.994811 
## [10] train-auc:0.998565  test-auc:0.994070 
## [11] train-auc:0.998507  test-auc:0.994176 
## [12] train-auc:0.998775  test-auc:0.994387 
## [13] train-auc:0.998928  test-auc:0.995340 
## [14] train-auc:0.999081  test-auc:0.995446 
## [15] train-auc:0.998947  test-auc:0.995340 
## [16] train-auc:0.999158  test-auc:0.995023 
## [17] train-auc:0.999483  test-auc:0.994493 
## [18] train-auc:0.999598  test-auc:0.994705 
## [19] train-auc:0.999541  test-auc:0.995235 
## [20] train-auc:0.999694  test-auc:0.995976 
## [21] train-auc:0.999636  test-auc:0.995976 
## [22] train-auc:0.999770  test-auc:0.996294 
## [23] train-auc:0.999866  test-auc:0.996294 
## [24] train-auc:0.999828  test-auc:0.996505 
## [25] train-auc:0.999847  test-auc:0.996294 
## [26] train-auc:0.999904  test-auc:0.996929 
## [27] train-auc:0.999943  test-auc:0.996082 
## [28] train-auc:0.999943  test-auc:0.996611 
## [29] train-auc:0.999904  test-auc:0.996611 
## [30] train-auc:0.999904  test-auc:0.995976 
## [31] train-auc:0.999962  test-auc:0.995870 
## [32] train-auc:0.999962  test-auc:0.995976 
## [33] train-auc:0.999962  test-auc:0.995870 
## [34] train-auc:0.999981  test-auc:0.995976 
## [35] train-auc:0.999962  test-auc:0.995976 
## [36] train-auc:0.999962  test-auc:0.996399 
## [37] train-auc:0.999981  test-auc:0.996294 
## [38] train-auc:0.999981  test-auc:0.996505 
## [39] train-auc:0.999981  test-auc:0.996611 
## [40] train-auc:0.999981  test-auc:0.996823 
## [41] train-auc:0.999981  test-auc:0.996929 
## [42] train-auc:0.999981  test-auc:0.996823 
## [43] train-auc:0.999981  test-auc:0.996505 
## [44] train-auc:0.999981  test-auc:0.996611 
## [45] train-auc:0.999981  test-auc:0.996611 
## Stopping. Best iteration:
## [46] train-auc:0.999981  test-auc:0.996505
## 
## [46] train-auc:0.999981  test-auc:0.996505

Τι παρατηρείς για τον αριθμό γύρων που χρειάστηκε;

eta	Συμπεριφορά
0.01	Χρειάζονται Πολλοί Γύροι (Αργή σύγκλιση,μικρά βήματα)
0.10	Καλό AUC με μέτριο αριθμό γύρων (Ισορροπημένο)
0.30	Χρειάζονται Λίγοι Γύροι (Υπάρχει κίνδυνος πήδηματος πάνω από το βέλτιστο)

Συμπέρασμα

Το eta=0.1 είναι η το καλύτερο για αυτό το dataset. Οι πολύ μικρές τιμές (0.01) απαιτούν υπολογιστικό κόστος χωρίς ανάλογο όφελος, ενώ οι πολύ μεγάλες (0.3+) μπορεί να οδηγήσουν σε instability ή suboptimal σύγκλιση.

TODO 9 (BONUS): Φτιάξε ένα ROC plot με 2 καμπύλες

(RF και XGBoost) στο ίδιο γράφημα

roc_rf  <- roc(test$Class, rf_pred_prob,  levels = c("benign", "malignant"))

## Setting direction: controls < cases

roc_xgb <- roc(test$Class, xgb_prob, levels = c("benign", "malignant"))

## Setting direction: controls < cases

plot(roc_rf,  col = "forestgreen", lwd = 2, main = "ROC Curves Comparison")
lines(roc_xgb, col = "#8B0000",   lwd = 2)
legend("bottomright",
       legend = c(paste0("RF (AUC = ",  round(auc_rf,  3), ")"),
                  paste0("XGB (AUC = ", round(xgb_auc, 3), ")"),
       col = c("forestgreen", "#8B0000"),
       lwd = 2))

Παρατήρηση

Και οι δύο καμπύλες βρίσκονται πολύ κοντά στην πάνω αριστερή γωνία, οπότε και τα δυο τα μοντέλα έχουν εξαιρετική ικανότητα διαχωρισμού των δύο κλάσεων (benign vs malignant). Η ελαφρά διαφορά στο AUC (~0.001-0.005) δεν είναι στατιστικά σημαντική σε αυτό το μέγεθος δείγματος.

BA-task009

2026-05-10

Φόρτωση πακέτων

Φόρτωση δεδομένων

Καθαρισμός: αφαίρεση ID, χειρισμός missing, μετατροπή σε numeric

ΜΕΡΟΣ Α — BASELINE με Random Forest

TODO 1: Κάνε stratified train/test split (70/30)

TODO 2: Εκπαίδευσε ένα Random Forest με ntree=500, importance=TRUE

TODO 3: Υπολόγισε Accuracy, Sensitivity, AUC στο test set

Συνοπτικά Αποτελέσματα RF

TODO 4: Δείξε το Variable Importance plot

Απάντηση Ερωτήσεων

Ερώτηση 1: Accuracy

Ερώτηση 2: Top-3 Features

Ερώτηση 3: Είναι 97% Accuracy “αρκετό” σε ιατρικό context;

ΜΕΡΟΣ Β — BOOSTING & TUNING

TODO 5: Προετοίμασε τα δεδομένα για XGBoost

TODO 6: Εκπαίδευσε ένα XGBoost μοντέλο με early stopping

TODO 7: Σύγκρινε XGBoost vs Random Forest σε ένα πίνακα

Συμπέρασμα

TODO 8 (BONUS): Δοκίμασε ΔΥΟ διαφορετικά eta (0.01 και 0.3)

Τι παρατηρείς για τον αριθμό γύρων που χρειάστηκε;

Συμπέρασμα

TODO 9 (BONUS): Φτιάξε ένα ROC plot με 2 καμπύλες

Παρατήρηση