This tutorial demonstrates how to build a prediction model using machine learning classification techniques.

Introduction

This project focuses on accurately classifying glioblastoma multiforme (GBM), lung squamous cell carcinoma (LUSC), and ovarian cancer (OV) using gene expression data from The Cancer Genome Atlas (TCGA). With 886 samples and 12,043 gene expression features, it leverages machine learning, particularly XGBoost, to improve cancer subtype classification for personalized treatment strategies. By analyzing high-dimensional genomic data, the study aims to discover biomarkers, enhance diagnostic precision, and provide insights into the molecular mechanisms driving these cancers. The results could significantly impact precision oncology by bridging genomic data with clinical decision-making.

Data Dre-processing

The data was cleaned by removing unique sample identifiers, reordering columns to place the target variable first, and converting it to a categorical format. A 70-30 train-test split ensured balanced training and evaluation. Features were transformed into a numeric matrix suitable for machine learning, with no additional engineering needed as the gene expression data was already log-transformed and standardized.

Methods

XGBoost was chosen for its ability to handle high-dimensional data and capture complex relationships. Using a maximum tree depth of 6, a learning rate of 0.1, and 100 boosting rounds, the model was optimized for multi-class classification. XGBoost was selected over alternatives like Random Forests and Neural Networks due to its efficiency, interpretability, and proven success in genomic datasets.

Model Performance

Model performance was assessed using a confusion matrix, accuracy, sensitivity, specificity, and precision. The balanced accuracy metric ensured robust performance across all classes. Feature importance analysis highlighted key gene expression predictors, aiding model interpretability and biomarker discovery.

Predictive Ability

The XGBoost model achieved 98.87% accuracy (95% CI: 96.74%-99.77%) and a Kappa of 0.9812, indicating excellent agreement between predictions and actual classifications. Class-specific sensitivity, specificity, and balanced accuracy scores were all high, demonstrating strong performance even with class imbalances. The model ranked first on the Kaggle leaderboard with a perfect public dataset score, highlighting its reliability for clinical applications and potential to improve personalized cancer treatment.

The model achieved a perfect score of 1.000 on the public dataset on Kaggle (www.kaggle.com/competitions/classification-of-cancer-type).

# Load data
training_data <- read.csv("train.csv")
# Data Cleaning
training_data <- training_data %>% select(-id)
training_data <- training_data[, c("cancer", sort(setdiff(names(training_data), "cancer")))]
training_data$cancer <- as.factor(training_data$cancer)
# 70-30 Train-Test Split
set.seed(1) 
train_indices <- sample(seq_len(nrow(training_data)), size = 0.7 * nrow(training_data))
train_set <- training_data[train_indices, ]
test_set <- training_data[-train_indices, ]
# Label and feature preparation
train_matrix <- as.matrix(train_set[, -1])
train_labels <- as.integer(train_set$cancer) - 1
test_labels <- as.integer(test_set$cancer) 
# Set up for XGBoost
dmat.train <- xgb.DMatrix(data = train_matrix, label = train_labels)
xgb_parameters <- list(
  objective = "multi:softmax",
  num_class = length(unique(train_labels)),
  max_depth = 6,
  eta = 0.1,
  nthread = 2
)
# Train the XGBoost model 
xgb_classifier <- xgboost(
  params = xgb_parameters,
  data = dmat.train,
  nrounds = 100,
  verbose = 1
)
# Make predictions using the trained XGBoost model
train_columns <- colnames(train_set)[-1]
test_set <- test_set[, train_columns]
test_set_matrix <- as.matrix(test_set)
dmat.test <- xgb.DMatrix(data = test_set_matrix)
predictions <- predict(xgb_classifier, newdata = dmat.test)
predictions1 <- as.factor(predictions + 1)
# Evaluate performance with a Confusion Matrix
confusionMatrix(data=predictions1, reference = as.factor(test_labels))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 112   2   0
##          2   0  31   0
##          3   0   1 120
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9887               
##                  95% CI : (0.9674, 0.9977)     
##     No Information Rate : 0.4511               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9812               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   0.9118   1.0000
## Specificity            0.9870   1.0000   0.9932
## Pos Pred Value         0.9825   1.0000   0.9917
## Neg Pred Value         1.0000   0.9872   1.0000
## Prevalence             0.4211   0.1278   0.4511
## Detection Rate         0.4211   0.1165   0.4511
## Detection Prevalence   0.4286   0.1165   0.4549
## Balanced Accuracy      0.9935   0.9559   0.9966

Inference

The figure below highlights the top predictors driving the XGBoost model’s performance in classifying the three cancer types among 58 features. CDX4, ABBA.1, and CALML3 are the most influential gene expression markers, significantly contributing to the model’s predictions. Their high relative importance indicates the model’s reliance on these genes for accurate classification, while features like PKP1 and RNF11 contribute minimally. This analysis enhances the model’s interpretability and suggests potential biomarkers for cancer diagnosis and treatment.

# Plot the feature importance
importance_matrix <- xgb.importance(colnames(dmat.train), model = xgb_classifier)
xgb.plot.importance(importance_matrix[1:10,])

Conclusion

This study’s main limitation is dataset imbalance, with fewer LUSC samples reducing model sensitivity despite XGBoost’s robustness. External validation and incorporating multi-omics data could improve generalizability and predictive power. Feature importance highlighted genes like CDX4 and CALML3 as potential biomarkers, requiring further validation for clinical use. Addressing these limitations and exploring advanced methods can enhance cancer diagnosis, support personalized treatments, and improve patient outcomes.