Automated Random Forest Training with R and Linux

Rocha LAJ, Silveira LO, Neves NPC

INTRODUCTION

Machine learning (ML) is a collection of strategies to build classification and regression models. Thanks to the increasing computational capacity and accessibility today, ML is used in multiple areas including health sciences The main advantages of ML models upon traditional biostatistics models include:

Adaptability
ML takes into account all the information available in a given dataset. Multiple settings can be changed (hyperparameters) to improve your prediction accuracy [1].
Interactions
ML can address interactions which are challenging in biostatistics models [2].
Diversity
ML can deal with different sources of data, including demographic information, images and laboratory findings [2].
Flexibility
ML models do not relay on typical distribution principles [3].

Among ML models, there is Random Forest(RF). RF Is based on multiple decision trees searching for important variables to reach a determinate outcome (supervised). In R, if the outcome is classified as factor, the RF model will run in classification mode, while continuous outcome will be used to build a regression model. The data is separated in multiple subsets constituted of nodes and the best tree is selected based on the entropy or purity of your nodes.

BUILDING RF MODEL

Typically, a RF classification in R follow these steps:

Summary analysis
After the data is imported
data <- read.csv(filename, header=T)
a summary analysis is done looking for Not Available (NA) cells or discrepant behavior
summary(data)
a similar analysis can be done with psych package using describe function.
psych::describe(data)
Split
After the data is verified and cleaned up as showed in step 1, a split can be done to separated the same dataset between training and validation. Using sample.split from caTools package as example.

   split      <- sample.split(data$Outcome, SplitRatio = 0.70)
   training   <- subset(data, split == TRUE) 
   validation <- subset(data, split == FALSE) #70-30 rule

   trainig$Outcome    <- as.factor(training$Outcome)
   validation$Outcome <- as.factor(validation$Outcome)

Training
After the split as showed in step 2, there is multiple ways to build a RF model. One can build using randomForest package or caret package:

   rf_randomForest <- randomForest(Outcome~., data=training)

   cross           <- trainControl(method = "cv", number = 5, p = .8, sampling = "rose")
   rf_caret        <- train(Outcome~., data=training, method="rf",  trControl = cross)

ROSE implemented in trainControl is one of many strategies to balance binary outcomes as an attempt to avoid under or overfitting. Number = 5 indicates the number of cross validation. P = 0.8 indicates how much of the data presented will be used during your training phase. Method = CV indicates how the cross validation will be done using 5 equal parts. In general, rf_caret will contain a similar tree object like randomForest generate, however, randomForest object is straight forward and easier to use with more packages supporting its format for downstream analysis. One can use str(rf_randomForest) or str(rf_caret) to inspect their proprieties.

Validation
Now that we trained our model, the validation step is essential to establish how much of our training is reproducible before we test our model with new data. This short presentation will limit itself on the loop between training and validation. In practice, validation can be done with:

   predmodel <- predict(rf_caret, validation) #or rf_randomForest
   cmatrix   <- confusionMatrix(predmodel, validation$Outcome)

The confusionMatrix will return a object with many information about our model performance, such as: accuracy, positive predictive value, negative predictive value, sensitivity and specificity. Such data will be important to build our log to analyze our data retrospectively in the future. Another important metric is F1score obtained with F_meas function as follows:

   f1score  <- F_meas(predmodel, validation$Outcome)

Now, to build our log we will need to select which metric from our cmatrix object is most relevant. As example, we will select accuracy, precision, recall, F1score and Pvalue of Accuracy > No Information Rate (NIR)

  acc       <- cmatrix$overall[["Accuracy"]]
  precision <- cmatrix$byClass[["Pos Pred Value"]]
  recall    <- cmatrix$byClass[["Sensitivity"]]
  f1score   <- F_meas(predModel, valid_set$Layers) 
  PNir      <- cmatrix$overall[['AccuracyPValue']]

We will need to organize such data to generate a table continuously:

  logx      <- paste(acc,"|",precision,"|",recall,"|",f1score,"|",PNir, collapse=" | ")
  cat(logx, sep="\n")

Finally, we can set which conditions we want to export, example:

  if (acc >= .60 & precision > .50 & recall > .50 & f1score >= .70 & PNir < 0.05) {
   date_str <- trimws(system("date +%d%m-%H", intern = TRUE))
   modelImage <- paste0("./models/bestRf_", date_str, ".RData")
   save.image(file = modelImage)
  }

In short: when a model shows accuracy > 0.60 and prediction > 0.50 and recall > 0.50 and f1score >= 0.70 and PNir < 0.05 (significant) then generate a file named bestRf_date with a image containing all objects and variables created to reach such results (including seeds, splits, important variables and forest itself). This image can be loaded in another season using: load(image_name). Notice that your image will be saved in a folder called “models”. Create it first.

All R functions need to be added into a file called (example) RFtraining.R

AUTOMATION

On linux side, a simple bash script to start the run and collect metrics continuously can be written as follows:

#!/bin/bash
currentTime=$(date +%d%m-%H)
echo "Accuracy | Precision | Recall | F1 | pACNir" > ./logs/run_$currentTime 
counter=0
bestrf=false 


while [ "$bestrf" = false ]; do 
  counter=$((counter+1))
  Rscript RFTraining.R >> ./logs/run_$currentTime  
  if [[ -e ./models/bestRf_$currentTime.RData ]]; then
  bestrf=true
  fi
done

Add the code above into a file (e.g: file.sh) and change to be executable: chmod +x file.sh. Then you can call with: ./file.sh from your terminal.

Notice that our track until we reached our best results will be exported into a run_currentTime file in a logs folder. Create the folder first. When the best result is achieved (with our given criteria), R will export an image. The script above will detect this image and immediately cease the run. Typically, a RF run for a small dataset won’t take more than 30 minutes, if your run is taking more time, you may need to adapt currentTime to keep all your records in one file. A similar automation can be achieved using https://github.com/bnosac/cronR

Finally, the advantage of our suggestion is the possibility to generate a log with statistics from individual runs, making it possible to track your progress and analyze your dataset behavior across multiple attempts to achieve a desired performance.

Despite the promises, we can observe the following limitations on ML models in health sciences:

Data Quality and Availability ML depends of appropriate data quality, however, acquiring high-quality data in health science is a challenge. The dataset in practice is often incomplete, heterogeneous or a specific population may not represent all populations even when they share the same phenotype [2].
Technical Differences Different forms of data acquisition may represent a noise in ML dataset. [3,4]

DISCLAIMER

Our example is based on advantages and limitations for health sciences and how we can use automation to improve our training, however, the code itself can be applied in any scenario.

REFERENCES

[1] Rajula HSR, Verlato G, Manchia M, Antonucci N, Fanos V. Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment. Medicina (Kaunas). 2020 Sep 8;56(9):455. doi: 10.3390/medicina56090455. PMID: 32911665; PMCID: PMC7560135.

[2] Habehh H, Gohel S. Machine Learning in Healthcare. Curr Genomics. 2021 Dec 16;22(4):291-300. doi: 10.2174/1389202922666210705124359. PMID: 35273459; PMCID: PMC8822225.

[3] Panch T, Szolovits P, Atun R. Artificial intelligence, machine learning and health systems. J Glob Health. 2018 Dec;8(2):020303. doi: 10.7189/jogh.08.020303. PMID: 30405904; PMCID: PMC6199467.

[4] Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019 Oct 29;17(1):195. doi: 10.1186/s12916-019-1426-2. PMID: 31665002; PMCID: PMC6821018.

SUGGESTED MATERIAL

http://rpubs.com/jvaldeleon/forest_repeat_cv

https://rpubs.com/archita25/677243