Introduction

This report details the steps and results of a machine learning pipeline developed for cancer data analysis. It includes data preprocessing, various feature selection techniques (Near-Zero Variance and Boruta), hyperparameter tuning of a Random Forest model, class imbalance handling using SMOTE, and evaluation using 10-fold cross-validation and a hold-out test set. The impact of key steps like different feature selection methods and SMOTE is also visualized.

Configuration and Data Loading

Global Configuration

The following global configurations were used for the analysis:

file_path <- "Norm_avg_plasma_data - Sheet2.csv" # IMPORTANT: Ensure this file is in your R working directory
positive_class_name <- "Cancer"
negative_class_name <- "Normal"

Data Loading and Initial Inspection

The dataset is loaded from Norm_avg_plasma_data - Sheet2.csv. It’s assumed to have samples as rows, a header row, with the first column for sample identifiers, the second for class labels, and subsequent columns for feature values.

## Dimensions of raw data:  75  rows,  2813  columns.
## Column names from CSV header:
Raw Data Column Names (Transposed for Readability)
Unique.Id Sample.type X0 X0.1 X0.117 X0.133333 X0.15 X0.166667 X0.183 X0.2 X0.216667 X0.233333 X0.25 X0.266667 X0.283333 X0.3 X0.317 X0.333333 X0.35 X0.366667 X0.383 X0.4 X0.416667 X0.433333 X0.45 X0.466667 X0.483333 X0.5 X0.517 X0.533333 X0.55 X0.566667 X0.583 X0.6 X0.616667 X0.633333 X0.65 X0.666667 X0.683333 X0.7 X0.717 X0.733333 X0.75 X0.766667 X0.783 X0.8 X0.816667 X0.833333 X0.85 X0.867 X0.883333 X0.9 X0.916667 X0.933333 X0.95 X0.966667 X0.983 X1 X1.016667 X1.033333 X1.05 X1.066667 X1.083333 X1.1 X1.116667 X1.133333 X1.15 X1.166667 X1.183333 X1.2 X1.216667 X1.233333 X1.25 X1.266667 X1.283333 X1.3 X1.316667 X1.333333 X1.35 X1.366667 X1.383333 X1.4 X1.416667 X1.433333 X1.45 X1.466667 X1.483333 X1.5 X1.516667 X1.533333 X1.55 X1.566667 X1.583333 X1.6 X1.616667 X1.633333 X1.65 X1.666667 X1.683333 X1.7 X1.716667 X1.733333 X1.75 X1.766667 X1.783333 X1.8 X1.816667 X1.833333 X1.85 X1.866667 X1.883333 X1.9 X1.916667 X1.933333 X1.95 X1.966667 X1.983333 X2 X2.016667 X2.033333 X2.05 X2.066667 X2.083333 X2.1 X2.116667 X2.133333 X2.15 X2.166667 X2.183333 X2.2 X2.216667 X2.233333 X2.25 X2.266667 X2.283333 X2.3 X2.316667 X2.333333 X2.35 X2.366667 X2.383333 X2.4 X2.416667 X2.433333 X2.45 X2.466667 X2.483333 X2.5 X2.516667 X2.533333 X2.55 X2.566667 X2.583333 X2.6 X2.616667 X2.633333 X2.65 X2.666667 X2.683333 X2.7 X2.716667 X2.733333 X2.75 X2.766667 X2.783333 X2.8 X2.816667 X2.833333 X2.85 X2.866667 X2.883333 X2.9 X2.916667 X2.933333 X2.95 X2.966667 X2.983333 X3 X3.016667 X3.033333 X3.05 X3.066667 X3.083333 X3.1 X3.116667 X3.133333 X3.15 X3.166667 X3.183333 X3.2 X3.216667 X3.233333 X3.25 X3.266667 X3.283333 X3.3 X3.316667 X3.333333 X3.35 X3.366667 X3.383333 X3.4 X3.416667 X3.433333 X3.45 X3.466667 X3.483333 X3.5 X3.516667 X3.533333 X3.55 X3.566667 X3.583333 X3.6 X3.616667 X3.633333 X3.65 X3.666667 X3.683333 X3.7 X3.716667 X3.733333 X3.75 X3.766667 X3.783333 X3.8 X3.816667 X3.833333 X3.85 X3.866667 X3.883333 X3.9 X3.916667 X3.933333 X3.95 X3.966667 X3.983333 X4 X4.016667 X4.033333 X4.05 X4.066667 X4.083333 X4.1 X4.116667 X4.133333 X4.15 X4.166667 X4.183333 X4.2 X4.216667 X4.233333 X4.25 X4.266667 X4.283333 X4.3 X4.316667 X4.333333 X4.35 X4.366667 X4.383333 X4.4 X4.416667 X4.433333 X4.45 X4.466667 X4.483333 X4.5 X4.516667 X4.533333 X4.55 X4.566667 X4.583333 X4.6 X4.616667 X4.633333 X4.65 X4.666667 X4.683333 X4.7 X4.716667 X4.733333 X4.75 X4.766667 X4.783333 X4.8 X4.816667 X4.833333 X4.85 X4.866667 X4.883333 X4.9 X4.916667 X4.933333 X4.95 X4.966667 X4.983333 X5 X5.016667 X5.033333 X5.05 X5.066667 X5.083333 X5.1 X5.116667 X5.133333 X5.15 X5.166667 X5.183333 X5.2 X5.216667 X5.233333 X5.25 X5.266667 X5.283333 X5.3 X5.316667 X5.333333 X5.35 X5.366667 X5.383333 X5.4 X5.416667 X5.433333 X5.45 X5.466667 X5.483333 X5.5 X5.516667 X5.533333 X5.55 X5.566667 X5.583333 X5.6 X5.616667 X5.633333 X5.65 X5.666667 X5.683333 X5.7 X5.716667 X5.733333 X5.75 X5.766667 X5.783333 X5.8 X5.816667 X5.833333 X5.85 X5.866667 X5.883333 X5.9 X5.916667 X5.933333 X5.95 X5.966667 X5.983333 X6 X6.016667 X6.033333 X6.05 X6.066667 X6.083333 X6.1 X6.116667 X6.133333 X6.15 X6.166667 X6.183333 X6.2 X6.216667 X6.233333 X6.25 X6.266667 X6.283333 X6.3 X6.316667 X6.333333 X6.35 X6.366667 X6.383333 X6.4 X6.416667 X6.433333 X6.45 X6.466667 X6.483333 X6.5 X6.516667 X6.533333 X6.55 X6.566667 X6.583333 X6.6 X6.616667 X6.633333 X6.65 X6.666667 X6.683333 X6.7 X6.716667 X6.733333 X6.75 X6.766667 X6.783333 X6.8 X6.816667 X6.833333 X6.85 X6.866667 X6.883333 X6.9 X6.916667 X6.933333 X6.95 X6.966667 X6.983333 X7 X7.016667 X7.033333 X7.05 X7.066667 X7.083333 X7.1 X7.116667 X7.133333 X7.15 X7.166667 X7.183333 X7.2 X7.216667 X7.233333 X7.25 X7.266667 X7.283333 X7.3 X7.316667 X7.333333 X7.35 X7.366667 X7.383333 X7.4 X7.416667 X7.433333 X7.45 X7.466667 X7.483333 X7.5 X7.516667 X7.533333 X7.55 X7.566667 X7.583333 X7.6 X7.616667 X7.633333 X7.65 X7.666667 X7.683333 X7.7 X7.716667 X7.733333 X7.75 X7.766667 X7.783333 X7.8 X7.816667 X7.833333 X7.85 X7.866667 X7.883333 X7.9 X7.916667 X7.933333 X7.95 X7.966667 X7.983333 X8 X8.016667 X8.033333 X8.05 X8.066667 X8.083333 X8.1 X8.116667 X8.133333 X8.15 X8.166667 X8.183333 X8.2 X8.216667 X8.233333 X8.25 X8.266667 X8.283333 X8.3 X8.316667 X8.333333 X8.35 X8.366667 X8.383333 X8.4 X8.416667 X8.433333 X8.45 X8.466667 X8.483333 X8.5 X8.516667 X8.533333 X8.55 X8.566667 X8.583333 X8.6 X8.616667 X8.633333 X8.65 X8.666667 X8.683333 X8.7 X8.716667 X8.733333 X8.75 X8.766667 X8.783333 X8.8 X8.816667 X8.833333 X8.85 X8.866667 X8.883333 X8.9 X8.916667 X8.933333 X8.95 X8.966667 X8.983333 X9 X9.016667 X9.033333 X9.05 X9.066667 X9.083333 X9.1 X9.116667 X9.133333 X9.15 X9.166667 X9.183333 X9.2 X9.216667 X9.233333 X9.25 X9.266667 X9.283333 X9.3 X9.316667 X9.333333 X9.35 X9.366667 X9.383333 X9.4 X9.416667 X9.433333 X9.45 X9.466667 X9.483333 X9.5 X9.516667 X9.533333 X9.55 X9.566667 X9.583333 X9.6 X9.616667 X9.633333 X9.65 X9.666667 X9.683333 X9.7 X9.716667 X9.733333 X9.75 X9.766667 X9.783333 X9.8 X9.816667 X9.833333 X9.85 X9.866667 X9.883333 X9.9 X9.916667 X9.933333 X9.95 X9.966667 X9.983333 X10 X10.01667 X10.03333 X10.05 X10.06667 X10.08333 X10.1 X10.11667 X10.13333 X10.15 X10.16667 X10.18333 X10.2 X10.21667 X10.23333 X10.25 X10.26667 X10.28333 X10.3 X10.31667 X10.33333 X10.35 X10.36667 X10.38333 X10.4 X10.41667 X10.43333 X10.45 X10.46667 X10.48333 X10.5 X10.51667 X10.53333 X10.55 X10.56667 X10.58333 X10.6 X10.61667 X10.63333 X10.65 X10.66667 X10.68333 X10.7 X10.71667 X10.73333 X10.75 X10.76667 X10.78333 X10.8 X10.81667 X10.83333 X10.85 X10.86667 X10.88333 X10.9 X10.91667 X10.93333 X10.95 X10.96667 X10.98333 X11 X11.01667 X11.03333 X11.05 X11.06667 X11.08333 X11.1 X11.11667 X11.13333 X11.15 X11.16667 X11.18333 X11.2 X11.21667 X11.23333 X11.25 X11.26667 X11.28333 X11.3 X11.31667 X11.33333 X11.35 X11.36667 X11.38333 X11.4 X11.41667 X11.43333 X11.45 X11.46667 X11.48333 X11.5 X11.51667 X11.53333 X11.55 X11.56667 X11.58333 X11.6 X11.61667 X11.63333 X11.65 X11.66667 X11.68333 X11.7 X11.71667 X11.73333 X11.75 X11.76667 X11.78333 X11.8 X11.81667 X11.83333 X11.85 X11.86667 X11.88333 X11.9 X11.91667 X11.93333 X11.95 X11.96667 X11.98333 X12 X12.01667 X12.03333 X12.05 X12.06667 X12.08333 X12.1 X12.11667 X12.13333 X12.15 X12.16667 X12.18333 X12.2 X12.21667 X12.23333 X12.25 X12.26667 X12.28333 X12.3 X12.31667 X12.33333 X12.35 X12.36667 X12.38333 X12.4 X12.41667 X12.43333 X12.45 X12.46667 X12.48333 X12.5 X12.51667 X12.53333 X12.55 X12.56667 X12.58333 X12.6 X12.61667 X12.63333 X12.65 X12.66667 X12.68333 X12.7 X12.71667 X12.73333 X12.75 X12.76667 X12.78333 X12.8 X12.81667 X12.83333 X12.85 X12.86667 X12.88333 X12.9 X12.91667 X12.93333 X12.95 X12.96667 X12.98333 X13 X13.01667 X13.03333 X13.05 X13.06667 X13.08333 X13.1 X13.11667 X13.13333 X13.15 X13.16667 X13.18333 X13.2 X13.21667 X13.23333 X13.25 X13.26667 X13.28333 X13.3 X13.31667 X13.33333 X13.35 X13.36667 X13.38333 X13.4 X13.41667 X13.43333 X13.45 X13.46667 X13.48333 X13.5 X13.51667 X13.53333 X13.55 X13.56667 X13.58333 X13.6 X13.61667 X13.63333 X13.65 X13.66667 X13.68333 X13.7 X13.71667 X13.73333 X13.75 X13.76667 X13.78333 X13.8 X13.81667 X13.83333 X13.85 X13.86667 X13.88333 X13.9 X13.91667 X13.93333 X13.95 X13.96667 X13.98333 X14 X14.01667 X14.03333 X14.05 X14.06667 X14.08333 X14.1 X14.11667 X14.13333 X14.15 X14.16667 X14.18333 X14.2 X14.21667 X14.23333 X14.25 X14.26667 X14.28333 X14.3 X14.31667 X14.33333 X14.35 X14.36667 X14.38333 X14.4 X14.41667 X14.43333 X14.45 X14.46667 X14.48333 X14.5 X14.51667 X14.53333 X14.55 X14.56667 X14.58333 X14.6 X14.61667 X14.63333 X14.65 X14.66667 X14.68333 X14.7 X14.71667 X14.73333 X14.75 X14.76667 X14.78333 X14.8 X14.81667 X14.83333 X14.85 X14.86667 X14.88333 X14.9 X14.91667 X14.93333 X14.95 X14.96667 X14.98333 X15 X15.01667 X15.03333 X15.05 X15.06667 X15.08333 X15.1 X15.11667 X15.13333 X15.15 X15.16667 X15.18333 X15.2 X15.21667 X15.23333 X15.25 X15.26667 X15.28333 X15.3 X15.31667 X15.33333 X15.35 X15.36667 X15.38333 X15.4 X15.41667 X15.43333 X15.45 X15.46667 X15.48333 X15.5 X15.51667 X15.53333 X15.55 X15.56667 X15.58333 X15.6 X15.61667 X15.63333 X15.65 X15.66667 X15.68333 X15.7 X15.71667 X15.73333 X15.75 X15.76667 X15.78333 X15.8 X15.81667 X15.83333 X15.85 X15.86667 X15.88333 X15.9 X15.91667 X15.93333 X15.95 X15.96667 X15.98333 X16 X16.01667 X16.03333 X16.05 X16.06667 X16.08333 X16.1 X16.11667 X16.13333 X16.15 X16.16667 X16.18333 X16.2 X16.21667 X16.23333 X16.25 X16.26667 X16.28333 X16.3 X16.31667 X16.33333 X16.35 X16.36667 X16.38333 X16.4 X16.41667 X16.43333 X16.45 X16.46667 X16.48333 X16.5 X16.51667 X16.53333 X16.55 X16.56667 X16.58333 X16.6 X16.61667 X16.63333 X16.65 X16.66667 X16.68333 X16.7 X16.71667 X16.73333 X16.75 X16.76667 X16.78333 X16.8 X16.81667 X16.83333 X16.85 X16.86667 X16.88333 X16.9 X16.91667 X16.93333 X16.95 X16.96667 X16.98333 X17 X17.01667 X17.03333 X17.05 X17.06667 X17.08333 X17.1 X17.11667 X17.13333 X17.15 X17.16667 X17.18333 X17.2 X17.21667 X17.23333 X17.25 X17.26667 X17.28333 X17.3 X17.31667 X17.33333 X17.35 X17.36667 X17.38333 X17.4 X17.41667 X17.43333 X17.45 X17.46667 X17.48333 X17.5 X17.51667 X17.53333 X17.55 X17.56667 X17.58333 X17.6 X17.61667 X17.63333 X17.65 X17.66667 X17.68333 X17.7 X17.71667 X17.73333 X17.75 X17.76667 X17.78333 X17.8 X17.81667 X17.83333 X17.85 X17.86667 X17.88333 X17.9 X17.91667 X17.93333 X17.95 X17.96667 X17.98333 X18 X18.01667 X18.03333 X18.05 X18.06667 X18.08333 X18.1 X18.11667 X18.13333 X18.15 X18.16667 X18.18333 X18.2 X18.21667 X18.23333 X18.25 X18.26667 X18.28333 X18.3 X18.31667 X18.33333 X18.35 X18.36667 X18.38333 X18.4 X18.41667 X18.43333 X18.45 X18.46667 X18.48333 X18.5 X18.51667 X18.53333 X18.55 X18.56667 X18.58333 X18.6 X18.61667 X18.63333 X18.65 X18.66667 X18.68333 X18.7 X18.71667 X18.73333 X18.75 X18.76667 X18.78333 X18.8 X18.81667 X18.83333 X18.85 X18.86667 X18.88333 X18.9 X18.91667 X18.93333 X18.95 X18.96667 X18.98333 X19 X19.01667 X19.03333 X19.05 X19.06667 X19.08333 X19.1 X19.11667 X19.13333 X19.15 X19.16667 X19.18333 X19.2 X19.21667 X19.23333 X19.25 X19.26667 X19.28333 X19.3 X19.31667 X19.33333 X19.35 X19.36667 X19.38333 X19.4 X19.41667 X19.43333 X19.45 X19.46667 X19.48333 X19.5 X19.51667 X19.53333 X19.55 X19.56667 X19.58333 X19.6 X19.61667 X19.63333 X19.65 X19.66667 X19.68333 X19.7 X19.71667 X19.73333 X19.75 X19.76667 X19.78333 X19.8 X19.81667 X19.83333 X19.85 X19.86667 X19.88333 X19.9 X19.91667 X19.93333 X19.95 X19.96667 X19.98333 X20 X20.01667 X20.03333 X20.05 X20.06667 X20.08333 X20.1 X20.11667 X20.13333 X20.15 X20.16667 X20.18333 X20.2 X20.21667 X20.23333 X20.25 X20.26667 X20.28333 X20.3 X20.31667 X20.33333 X20.35 X20.36667 X20.38333 X20.4 X20.41667 X20.43333 X20.45 X20.46667 X20.48333 X20.5 X20.51667 X20.53333 X20.55 X20.56667 X20.58333 X20.6 X20.61667 X20.63333 X20.65 X20.66667 X20.68333 X20.7 X20.71667 X20.73333 X20.75 X20.76667 X20.78333 X20.8 X20.81667 X20.83333 X20.85 X20.86667 X20.88333 X20.9 X20.91667 X20.93333 X20.95 X20.96667 X20.98333 X21 X21.01667 X21.03333 X21.05 X21.06667 X21.08333 X21.1 X21.11667 X21.13333 X21.15 X21.16667 X21.18333 X21.2 X21.21667 X21.23333 X21.25 X21.26667 X21.28333 X21.3 X21.31667 X21.33333 X21.35 X21.36667 X21.38333 X21.4 X21.41667 X21.43333 X21.45 X21.46667 X21.48333 X21.5 X21.51667 X21.53333 X21.55 X21.56667 X21.58333 X21.6 X21.61667 X21.63333 X21.65 X21.66667 X21.68333 X21.7 X21.71667 X21.73333 X21.75 X21.76667 X21.78333 X21.8 X21.81667 X21.83333 X21.85 X21.86667 X21.88333 X21.9 X21.91667 X21.93333 X21.95 X21.96667 X21.98333 X22 X22.01667 X22.03333 X22.05 X22.06667 X22.08333 X22.1 X22.11667 X22.13333 X22.15 X22.16667 X22.18333 X22.2 X22.21667 X22.23333 X22.25 X22.26667 X22.28333 X22.3 X22.31667 X22.33333 X22.35 X22.36667 X22.38333 X22.4 X22.41667 X22.43333 X22.45 X22.46667 X22.48333 X22.5 X22.51667 X22.53333 X22.55 X22.56667 X22.58333 X22.6 X22.61667 X22.63333 X22.65 X22.66667 X22.68333 X22.7 X22.71667 X22.73333 X22.75 X22.76667 X22.78333 X22.8 X22.81667 X22.83333 X22.85 X22.86667 X22.88333 X22.9 X22.91667 X22.93333 X22.95 X22.96667 X22.98333 X23 X23.01667 X23.03333 X23.05 X23.06667 X23.08333 X23.1 X23.11667 X23.13333 X23.15 X23.16667 X23.18333 X23.2 X23.21667 X23.23333 X23.25 X23.26667 X23.28333 X23.3 X23.31667 X23.33333 X23.35 X23.36667 X23.38333 X23.4 X23.41667 X23.43333 X23.45 X23.46667 X23.48333 X23.5 X23.51667 X23.53333 X23.55 X23.56667 X23.58333 X23.6 X23.61667 X23.63333 X23.65 X23.66667 X23.68333 X23.7 X23.71667 X23.73333 X23.75 X23.76667 X23.78333 X23.8 X23.81667 X23.83333 X23.85 X23.86667 X23.88333 X23.9 X23.91667 X23.93333 X23.95 X23.96667 X23.98333 X24 X24.01667 X24.03333 X24.05 X24.06667 X24.08333 X24.1 X24.11667 X24.13333 X24.15 X24.16667 X24.18333 X24.2 X24.21667 X24.23333 X24.25 X24.26667 X24.28333 X24.3 X24.31667 X24.33333 X24.35 X24.36667 X24.38333 X24.4 X24.41667 X24.43333 X24.45 X24.46667 X24.48333 X24.5 X24.51667 X24.53333 X24.55 X24.56667 X24.58333 X24.6 X24.61667 X24.63333 X24.65 X24.66667 X24.68333 X24.7 X24.71667 X24.73333 X24.75 X24.76667 X24.78333 X24.8 X24.81667 X24.83333 X24.85 X24.86667 X24.88333 X24.9 X24.91667 X24.93333 X24.95 X24.96667 X24.98333 X25 X25.01667 X25.03333 X25.05 X25.06667 X25.08333 X25.1 X25.11667 X25.13333 X25.15 X25.16667 X25.18333 X25.2 X25.21667 X25.23333 X25.25 X25.26667 X25.28333 X25.3 X25.31667 X25.33333 X25.35 X25.36667 X25.38333 X25.4 X25.41667 X25.43333 X25.45 X25.46667 X25.48333 X25.5 X25.51667 X25.53333 X25.55 X25.56667 X25.58333 X25.6 X25.61667 X25.63333 X25.65 X25.66667 X25.68333 X25.7 X25.71667 X25.73333 X25.75 X25.76667 X25.78333 X25.8 X25.81667 X25.83333 X25.85 X25.86667 X25.88333 X25.9 X25.91667 X25.93333 X25.95 X25.96667 X25.98333 X26 X26.01667 X26.03333 X26.05 X26.06667 X26.08333 X26.1 X26.11667 X26.13333 X26.15 X26.16667 X26.18333 X26.2 X26.21667 X26.23333 X26.25 X26.26667 X26.28333 X26.3 X26.31667 X26.33333 X26.35 X26.36667 X26.38333 X26.4 X26.41667 X26.43333 X26.45 X26.46667 X26.48333 X26.5 X26.51667 X26.53333 X26.55 X26.56667 X26.58333 X26.6 X26.61667 X26.63333 X26.65 X26.66667 X26.68333 X26.7 X26.71667 X26.73333 X26.75 X26.76667 X26.78333 X26.8 X26.81667 X26.83333 X26.85 X26.86667 X26.88333 X26.9 X26.91667 X26.93333 X26.95 X26.96667 X26.98333 X27 X27.01667 X27.03333 X27.05 X27.06667 X27.08333 X27.1 X27.11667 X27.13333 X27.15 X27.16667 X27.18333 X27.2 X27.21667 X27.23333 X27.25 X27.26667 X27.28333 X27.3 X27.31667 X27.33333 X27.35 X27.36667 X27.38333 X27.4 X27.41667 X27.43333 X27.45 X27.46667 X27.48333 X27.5 X27.51667 X27.53333 X27.55 X27.56667 X27.58333 X27.6 X27.61667 X27.63333 X27.65 X27.66667 X27.68333 X27.7 X27.71667 X27.73333 X27.75 X27.76667 X27.78333 X27.8 X27.81667 X27.83333 X27.85 X27.86667 X27.88333 X27.9 X27.91667 X27.93333 X27.95 X27.96667 X27.98333 X28 X28.01667 X28.03333 X28.05 X28.06667 X28.08333 X28.1 X28.11667 X28.13333 X28.15 X28.16667 X28.18333 X28.2 X28.21667 X28.23333 X28.25 X28.26667 X28.28333 X28.3 X28.31667 X28.33333 X28.35 X28.36667 X28.38333 X28.4 X28.41667 X28.43333 X28.45 X28.46667 X28.48333 X28.5 X28.51667 X28.53333 X28.55 X28.56667 X28.58333 X28.6 X28.61667 X28.63333 X28.65 X28.66667 X28.68333 X28.7 X28.71667 X28.73333 X28.75 X28.76667 X28.78333 X28.8 X28.81667 X28.83333 X28.85 X28.86667 X28.88333 X28.9 X28.91667 X28.93333 X28.95 X28.96667 X28.98333 X29 X29.01667 X29.03333 X29.05 X29.06667 X29.08333 X29.1 X29.11667 X29.13333 X29.15 X29.16667 X29.18333 X29.2 X29.21667 X29.23333 X29.25 X29.26667 X29.28333 X29.3 X29.31667 X29.33333 X29.35 X29.36667 X29.38333 X29.4 X29.41667 X29.43333 X29.45 X29.46667 X29.48333 X29.5 X29.51667 X29.53333 X29.55 X29.56667 X29.58333 X29.6 X29.61667 X29.63333 X29.65 X29.66667 X29.68333 X29.7 X29.71667 X29.73333 X29.75 X29.76667 X29.78333 X29.8 X29.81667 X29.83333 X29.85 X29.86667 X29.88333 X29.9 X29.91667 X29.93333 X29.95 X29.96667 X29.98333 X30 X30.01667 X30.03333 X30.05 X30.06667 X30.08333 X30.1 X30.11667 X30.13333 X30.15 X30.16667 X30.18333 X30.2 X30.21667 X30.23333 X30.25 X30.26667 X30.28333 X30.3 X30.31667 X30.33333 X30.35 X30.36667 X30.38333 X30.4 X30.41667 X30.43333 X30.45 X30.46667 X30.48333 X30.5 X30.51667 X30.53333 X30.55 X30.56667 X30.58333 X30.6 X30.61667 X30.63333 X30.65 X30.66667 X30.68333 X30.7 X30.71667 X30.73333 X30.75 X30.76667 X30.78333 X30.8 X30.81667 X30.83333 X30.85 X30.86667 X30.88333 X30.9 X30.91667 X30.93333 X30.95 X30.96667 X30.98333 X31 X31.01667 X31.03333 X31.05 X31.06667 X31.08333 X31.1 X31.11667 X31.13333 X31.15 X31.16667 X31.18333 X31.2 X31.21667 X31.23333 X31.25 X31.26667 X31.28333 X31.3 X31.31667 X31.33333 X31.35 X31.36667 X31.38333 X31.4 X31.41667 X31.43333 X31.45 X31.46667 X31.48333 X31.5 X31.51667 X31.53333 X31.55 X31.56667 X31.58333 X31.6 X31.61667 X31.63333 X31.65 X31.66667 X31.68333 X31.7 X31.71667 X31.73333 X31.75 X31.76667 X31.78333 X31.8 X31.81667 X31.83333 X31.85 X31.86667 X31.88333 X31.9 X31.91667 X31.93333 X31.95 X31.96667 X31.98333 X32 X32.01667 X32.03333 X32.05 X32.06667 X32.08333 X32.1 X32.11667 X32.13333 X32.15 X32.16667 X32.18333 X32.2 X32.21667 X32.23333 X32.25 X32.26667 X32.28333 X32.3 X32.31667 X32.33333 X32.35 X32.36667 X32.38333 X32.4 X32.41667 X32.43333 X32.45 X32.46667 X32.48333 X32.5 X32.51667 X32.53333 X32.55 X32.56667 X32.58333 X32.6 X32.61667 X32.63333 X32.65 X32.66667 X32.68333 X32.7 X32.71667 X32.73333 X32.75 X32.76667 X32.78333 X32.8 X32.81667 X32.83333 X32.85 X32.86667 X32.88333 X32.9 X32.91667 X32.93333 X32.95 X32.96667 X32.98333 X33 X33.01667 X33.03333 X33.05 X33.06667 X33.08333 X33.1 X33.11667 X33.13333 X33.15 X33.16667 X33.18333 X33.2 X33.21667 X33.23333 X33.25 X33.26667 X33.28333 X33.3 X33.31667 X33.33333 X33.35 X33.36667 X33.38333 X33.4 X33.41667 X33.43333 X33.45 X33.46667 X33.48333 X33.5 X33.51667 X33.53333 X33.55 X33.56667 X33.58333 X33.6 X33.61667 X33.63333 X33.65 X33.66667 X33.68333 X33.7 X33.71667 X33.73333 X33.75 X33.76667 X33.78333 X33.8 X33.81667 X33.83333 X33.85 X33.86667 X33.88333 X33.9 X33.91667 X33.93333 X33.95 X33.96667 X33.98333 X34 X34.01667 X34.03333 X34.05 X34.06667 X34.08333 X34.1 X34.11667 X34.13333 X34.15 X34.16667 X34.18333 X34.2 X34.21667 X34.23333 X34.25 X34.26667 X34.28333 X34.3 X34.31667 X34.33333 X34.35 X34.36667 X34.38333 X34.4 X34.41667 X34.43333 X34.45 X34.46667 X34.48333 X34.5 X34.51667 X34.53333 X34.55 X34.56667 X34.58333 X34.6 X34.61667 X34.63333 X34.65 X34.66667 X34.68333 X34.7 X34.71667 X34.73333 X34.75 X34.76667 X34.78333 X34.8 X34.81667 X34.83333 X34.85 X34.86667 X34.88333 X34.9 X34.91667 X34.93333 X34.95 X34.96667 X34.98333 X35 X35.01667 Wavelength_200 X201 X202 X203 X204 X205 X206 X207 X208 X209 X210 X211 X212 X213 X214 X215 X216 X217 X218 X219 X220 X221 X222 X223 X224 X225 X226 X227 X228 X229 X230 X231 X232 X233 X234 X235 X236 X237 X238 X239 X240 X241 X242 X243 X244 X245 X246 X247 X248 X249 X250 X251 X252 X253 X254 X255 X256 X257 X258 X259 X260 X261 X262 X263 X264 X265 X266 X267 X268 X269 X270 X271 X272 X273 X274 X275 X276 X277 X278 X279 X280 X281 X282 X283 X284 X285 X286 X287 X288 X289 X290 X291 X292 X293 X294 X295 X296 X297 X298 X299 X300 X301 X302 X303 X304 X305 X306 X307 X308 X309 X310 X311 X312 X313 X314 X315 X316 X317 X318 X319 X320 X321 X322 X323 X324 X325 X326 X327 X328 X329 X330 X331 X332 X333 X334 X335 X336 X337 X338 X339 X340 X341 X342 X343 X344 X345 X346 X347 X348 X349 X350 X351 X352 X353 X354 X355 X356 X357 X358 X359 X360 X361 X362 X363 X364 X365 X366 X367 X368 X369 X370 X371 X372 X373 X374 X375 X376 X377 X378 X379 X380 X381 X382 X383 X384 X385 X386 X387 X388 X389 X390 X391 X392 X393 X394 X395 X396 X397 X398 X399 X400 X401 X402 X403 X404 X405 X406 X407 X408 X409 X410 X411 X412 X413 X414 X415 X416 X417 X418 X419 X420 X421 X422 X423 X424 X425 X426 X427 X428 X429 X430 X431 X432 X433 X434 X435 X436 X437 X438 X439 X440 X441 X442 X443 X444 X445 X446 X447 X448 X449 X450 X451 X452 X453 X454 X455 X456 X457 X458 X459 X460 X461 X462 X463 X464 X465 X466 X467 X468 X469 X470 X471 X472 X473 X474 X475 X476 X477 X478 X479 X480 X481 X482 X483 X484 X485 X486 X487 X488 X489 X490 X491 X492 X493 X494 X495 X496 X497 X498 X499 X500 X501 X502 X503 X504 X505 X506 X507 X508 X509 X510 X511 X512 X513 X514 X515 X516 X517 X518 X519 X520 X521 X522 X523 X524 X525 X526 X527 X528 X529 X530 X531 X532 X533 X534 X535 X536 X537 X538 X539 X540 X541 X542 X543 X544 X545 X546 X547 X548 X549 X550 X551 X552 X553 X554 X555 X556 X557 X558 X559 X560 X561 X562 X563 X564 X565 X566 X567 X568 X569 X570 X571 X572 X573 X574 X575 X576 X577 X578 X579 X580 X581 X582 X583 X584 X585 X586 X587 X588 X589 X590 X591 X592 X593 X594 X595 X596 X597 X598 X599 X600 X601 X602 X603 X604 X605 X606 X607 X608 X609 X610 X611 X612 X613 X614 X615 X616 X617 X618 X619 X620 X621 X622 X623 X624 X625 X626 X627 X628 X629 X630 X631 X632 X633 X634 X635 X636 X637 X638 X639 X640 X641 X642 X643 X644 X645 X646 X647 X648 X649 X650 X651 X652 X653 X654 X655 X656 X657 X658 X659 X660 X661 X662 X663 X664 X665 X666 X667 X668 X669 X670 X671 X672 X673 X674 X675 X676 X677 X678 X679 X680 X681 X682 X683 X684 X685 X686 X687 X688 X689 X690 X691 X692 X693 X694 X695 X696 X697 X698 X699 X700 X701 X702 X703 X704 X705 X706 X707 X708 X709 X710 X711 X712 X713 X714 X715 X716 X717 X718 X719 X720 X721 X722 X723 X724 X725 X726 X727 X728 X729 X730 X731 X732 X733 X734 X735 X736 X737 X738 X739 X740 X741 X742 X743 X744 X745 X746 X747 X748 X749 X750 X751 X752 X753 X754 X755 X756 X757 X758 X759 X760 X761 X762 X763 X764 X765 X766 X767 X768 X769 X770 X771 X772 X773 X774 X775 X776 X777 X778 X779 X780 X781 X782 X783 X784 X785 X786 X787 X788 X789 X790 X791 X792 X793 X794 X795 X796 X797 X798 X799 X800 X801 X802 X803 X804 X805 X806 X807 X808 X809 X810 X811 X812 X813 X814 X815 X816 X817 X818 X819 X820 X821 X822 X823 X824 X825 X826 X827 X828 X829 X830 X831 X832 X833 X834 X835 X836 X837 X838 X839 X840 X841 X842 X843 X844 X845 X846 X847 X848 X849 X850 X851 X852 X853 X854 X855 X856 X857 X858 X859 X860 X861 X862 X863 X864 X865 X866 X867 X868 X869 X870 X871 X872 X873 X874 X875 X876 X877 X878 X879 X880 X881 X882 X883 X884 X885 X886 X887 X888 X889 X890 X891 X892 X893 X894 X895 X896 X897 X898 X899 X900 X1.10 X2.10 X3.10 X4.10 X5.10 X6.10 X7.10 X8.10 X9.10 X10.10 X11.10 X12.10 X13.10
## 
## First 5 rows and 5 columns of raw data:
Head of Raw Data
Unique.Id Sample.type X0 X0.1 X0.117
PC1 Cancer -3.63e-03 -0.002010 -1.52e-03
PC2 Cancer -7.33e-05 -0.000210 -3.70e-04
PC3 Cancer -1.47e-05 0.000195 1.75e-04
PC4 Cancer -8.60e-04 0.000411 7.00e-04
PC5 Cancer 1.12e-04 0.000057 1.93e-05

Initial Data Preparation

This section covers the initial processing of the raw data to prepare it for feature engineering and modeling. This includes extracting sample IDs, class labels, and features, converting features to numeric, and handling any samples with missing class labels. This prepared dataset (df_initial_features) will be the input for feature selection methods.

## Number of features identified initially:  2811
## Dimensions of initial feature data (df_initial_features) after NA Class removal:  75  samples,  2812  columns (including Class).
## Class distribution in df_initial_features:
Class Distribution in Initial Data (for Feature Selection)
Var1 Freq
Normal 48
Cancer 27

Feature Selection Methods

We explore two feature selection methods: Near-Zero Variance (NZV) filtering and Boruta.

Feature Reduction: Near-Zero Variance (NZV)

NZV filtering removes features with little to no variance.

## Number of near-zero variance features removed:  0
## Number of features remaining after NZV filtering:  2811
## Dimensions of NZV-filtered feature data (df_nzv_features):  75  samples,  2812  columns (including Class).

Feature Selection: Boruta

Boruta is a wrapper algorithm built around Random Forest that iteratively compares the importance of original features with that of random shadow features.

Note: Boruta can be computationally intensive, especially with a large number of features or samples. The maxRuns parameter controls the maximum number of Random Forest runs.

## Starting Boruta feature selection... This may take some time.
## Boruta feature selection complete.
## Number of features selected by Boruta (Confirmed only):  7
## Dimensions of Boruta-selected feature data (df_boruta_features):  75  samples,  8  columns (including Class).

Model Training Framework

Helper Functions

# Helper function for training and preprocessing
train_rf_model <- function(data_to_train, model_name_suffix, train_control_config, tune_grid_config) {
    set.seed(123) 
    train_idx_helper <- createDataPartition(data_to_train$Class, p = .80, list = FALSE, times = 1)
    cv_train_data_helper <- data_to_train[train_idx_helper, ]
    
    original_full_data_rownames <- rownames(data_to_train)
    holdout_indices_from_original <- which(!original_full_data_rownames %in% rownames(cv_train_data_helper))

    cat(paste("\n--- Training RF Model:", model_name_suffix, "---\n"))
    cat("Dimensions of CV training data for", model_name_suffix, ":", dim(cv_train_data_helper)[1],"x",dim(cv_train_data_helper)[2],"\n")
    
    features_for_preproc_helper <- cv_train_data_helper[, -which(names(cv_train_data_helper) == "Class"), drop = FALSE]
    # Ensure there are features to preprocess
    if(ncol(features_for_preproc_helper) == 0) {
        cat("Warning: No features to preprocess for model", model_name_suffix, ". Model training might fail or be trivial.\n")
        # Create a dummy processed_cv_train_data if no features, to allow train to proceed (it will likely be a majority class classifier)
        processed_cv_train_data_helper <- cv_train_data_helper[, "Class", drop=FALSE]
        current_preProcValues_helper <- NULL # No preprocessor
    } else {
        current_preProcValues_helper <- preProcess(features_for_preproc_helper, method = c("center", "scale", "medianImpute"))
        processed_cv_train_features_helper <- predict(current_preProcValues_helper, features_for_preproc_helper)
        processed_cv_train_data_helper <- cbind(processed_cv_train_features_helper, Class = cv_train_data_helper$Class)
    }

    minority_size_helper <- min(table(processed_cv_train_data_helper$Class))
    if ("sampling" %in% names(train_control_config) && !is.null(train_control_config$sampling) && minority_size_helper < 10 && ncol(features_for_preproc_helper) > 0) {
        # Warning for SMOTE
    }

    set.seed(456)
    if(ncol(processed_cv_train_data_helper) <= 1 && !"Class" %in% colnames(processed_cv_train_data_helper)) { # Only class or empty
        cat("Skipping model training for", model_name_suffix, "due to no predictive features after preprocessing.\n")
        return(list(model = NULL, preprocessor = current_preProcValues_helper, holdout_indices_original = holdout_indices_from_original))
    }

    model_fit <- tryCatch({
        train(
            Class ~ .,
            data = processed_cv_train_data_helper,
            method = "rf",
            trControl = train_control_config,
            metric = "ROC",
            tuneGrid = tune_grid_config,
            importance = TRUE,
            na.action = na.omit
        )
    }, error = function(e) {
        cat("Error during training for", model_name_suffix, ":", e$message, "\n")
        return(NULL) # Return NULL if training fails
    })
    
    if(!is.null(model_fit)){
        cat("Training complete for RF Model:", model_name_suffix, "\n")
    }
    
    return(list(model = model_fit, 
                preprocessor = current_preProcValues_helper, 
                holdout_indices_original = holdout_indices_from_original))
}

create_rf_tune_grid <- function(data_frame_for_features) {
    # Check if there are any feature columns other than 'Class'
    feature_cols <- setdiff(colnames(data_frame_for_features), "Class")
    if(length(feature_cols) == 0) {
        # No features to tune mtry for, return a default grid that RF can handle (e.g. mtry=1 if Class is the only col, though RF will fail)
        # Or, more robustly, this case should be handled before calling train (e.g. skip training)
        # For now, let's assume if this function is called, there's at least one feature.
        # If features_for_preproc_helper is empty, this will lead to num_feats = 0
        warning("create_rf_tune_grid called with no feature columns. mtry grid will be minimal.")
        return(expand.grid(mtry = 1)) # Default mtry for RF if p=0 is problematic
    }
    num_feats <- length(feature_cols)
    default_m <- floor(sqrt(num_feats))
    grid_vals <- unique(c(
        max(1, floor(default_m / 2)), # ensure mtry is at least 1
        max(1, default_m),
        min(num_feats, max(1, default_m * 2)), 
        min(num_feats, 10), 
        min(num_feats, 20)
    ))
    grid_vals <- sort(unique(grid_vals[grid_vals <= num_feats & grid_vals > 0]))
    if(length(grid_vals) == 0) { grid_vals <- c(max(1, num_feats)) }
    return(expand.grid(mtry = grid_vals))
}

Base trainControl Configuration

base_train_control <- trainControl(
    method = "cv", number = 10, summaryFunction = twoClassSummary,
    classProbs = TRUE, verboseIter = FALSE, allowParallel = TRUE 
)

Model Training Scenarios and Comparative Analysis

Four Random Forest model scenarios are trained and compared: 1. All Features + SMOTE: Uses all initial features with SMOTE. 2. NZV Features + SMOTE: Uses NZV-filtered features with SMOTE. 3. Boruta Features + SMOTE (New Main Model): Uses Boruta-selected features with SMOTE. This is now considered the primary model. 4. Boruta Features + No SMOTE: Uses Boruta-selected features without SMOTE (for SMOTE impact on best feature set).

Scenario 1: All Features + SMOTE Tune grid (mtry values): 10, 20, 26, 53, 106

— Training RF Model: AllFeatures_SMOTE — Dimensions of CV training data for AllFeatures_SMOTE : 61 x 2812 Training complete for RF Model: AllFeatures_SMOTE Best mtry: 20 CV AUCROC (best tune): 0.8062

Scenario 2: NZV Features + SMOTE Tune grid (mtry values): 10, 20, 26, 53, 106

— Training RF Model: NZVFeatures_SMOTE — Dimensions of CV training data for NZVFeatures_SMOTE : 61 x 2812 Training complete for RF Model: NZVFeatures_SMOTE Best mtry: 20 CV AUCROC (best tune): 0.8062

Scenario 3: Boruta Features + SMOTE (Main Model) Tune grid (mtry values): 1, 2, 4, 7

— Training RF Model: BorutaFeatures_SMOTE (Main) — Dimensions of CV training data for BorutaFeatures_SMOTE (Main) : 61 x 8 Training complete for RF Model: BorutaFeatures_SMOTE (Main) Best mtry: 4 CV AUCROC (best tune): 0.7375

Scenario 4: Boruta Features + NO SMOTE Tune grid (mtry values): 1, 2, 4, 7

— Training RF Model: BorutaFeatures_NoSMOTE — Dimensions of CV training data for BorutaFeatures_NoSMOTE : 61 x 8 Training complete for RF Model: BorutaFeatures_NoSMOTE Best mtry: 1 CV AUCROC (best tune): 0.7458

Visualization of Key Step Impacts

Hyperparameter Tuning Profile (Main Boruta Model)

Hyperparameter Tuning Profile for the Main Model (Boruta Features + SMOTE).

Hyperparameter Tuning Profile for the Main Model (Boruta Features + SMOTE).

Impact of Feature Selection Method

This plot compares the best cross-validated AUCROC from models trained with All Features, NZV-filtered Features, and Boruta-selected Features (all with SMOTE and hyperparameter tuning).

Impact of Feature Selection Method on Cross-Validated AUCROC.

Impact of Feature Selection Method on Cross-Validated AUCROC.

Impact of SMOTE (on Boruta-Selected Features)

Impact of SMOTE on Performance (Boruta Features, Best Tune).

Impact of SMOTE on Performance (Boruta Features, Best Tune).

Main Model (Boruta Features + SMOTE): CV Performance Distribution

CV Performance Distribution (Main Boruta Model).

CV Performance Distribution (Main Boruta Model).

Main Model (Boruta Features + SMOTE): Evaluation on Hold-out Test Set

The main model (Boruta Features + SMOTE, best mtry) is evaluated on a separate hold-out test set.

## **Confusion Matrix (Hold-out - Main Boruta Model):**
## 
## **Overall Statistics (Hold-out - Main Boruta Model):**
## 
## **Class Statistics (Hold-out - Main Boruta Model):**
## 
## AUC (Hold-out - Main Boruta Model): 0.9333
ROC Curve for Main Boruta Model on Hold-out Test Set.

ROC Curve for Main Boruta Model on Hold-out Test Set.

Main Model (Boruta Features + SMOTE): Variable Importance

The plot below shows the top features identified by the main model (Boruta Features + SMOTE, best mtry).

Top Important Features (Main Boruta Model).

Top Important Features (Main Boruta Model).

Conclusion

This report presented a comprehensive machine learning pipeline, including data preprocessing, feature selection (NZV and Boruta), hyperparameter tuning for a Random Forest model, class imbalance handling (SMOTE), and thorough evaluation using 10-fold cross-validation and a hold-out test set. The main model utilized features selected by Boruta, along with SMOTE and hyperparameter tuning.

Key findings include: - The performance of models with different mtry values. - The impact of different feature selection methods (All Features vs. NZV vs. Boruta) on model performance (AUCROC). - The effect of SMOTE on AUCROC, Sensitivity, and Specificity for the Boruta-selected feature set. - The stability of the main Boruta model’s performance across 10 CV folds. - The final performance of the main Boruta model on an unseen hold-out test set. - The top features contributing to the main Boruta model’s predictions.

These results provide a solid foundation for understanding the dataset and the model’s predictive capabilities. Boruta feature selection, in conjunction with other pipeline steps, aimed to identify a robust and informative set of features for classification.