Rocha LAJ, Silveira LO, Neves NPC
INTRODUCTION
Machine learning (ML) is a collection of strategies to build classification and regression models. Thanks to the increasing computational capacity and accessibility today, ML is used in multiple areas including health sciences The main advantages of ML models upon traditional biostatistics models include:
Adaptability
ML takes into account all the information available in a given dataset. Multiple settings can be changed (hyperparameters) to improve your prediction accuracy [1].Interactions
ML can address interactions which are challenging in biostatistics models [2].Diversity
ML can deal with different sources of data, including demographic information, images and laboratory findings [2].Flexibility
ML models do not relay on typical distribution principles [3].
Among ML models, there is Random Forest(RF). RF Is based on multiple decision trees searching for important variables to reach a determinate outcome (supervised). In R, if the outcome is classified as factor, the RF model will run in classification mode, while continuous outcome will be used to build a regression model. The data is separated in multiple subsets constituted of nodes and the best tree is selected based on the entropy or purity of your nodes.
BUILDING RF MODEL
Typically, a RF classification in R follow these steps:
Summary analysis
After the data is imported
data <- read.csv(filename, header=T)
a summary analysis is done looking for Not Available (NA) cells or discrepant behavior
summary(data)
a similar analysis can be done with psych package using describe function.
psych::describe(data)Split
After the data is verified and cleaned up as showed in step 1, a split can be done to separated the same dataset between training and validation. Using sample.split fromcaToolspackage as example.
split <- sample.split(data$Outcome, SplitRatio = 0.70)
training <- subset(data, split == TRUE)
validation <- subset(data, split == FALSE) #70-30 rule
trainig$Outcome <- as.factor(training$Outcome)
validation$Outcome <- as.factor(validation$Outcome)
- Training
After the split as showed in step 2, there is multiple ways to build a RF model. One can build usingrandomForestpackage orcaretpackage:
rf_randomForest <- randomForest(Outcome~., data=training)
cross <- trainControl(method = "cv", number = 5, p = .8, sampling = "rose")
rf_caret <- train(Outcome~., data=training, method="rf", trControl = cross)
ROSE implemented in trainControl is one of
many strategies to balance binary outcomes as an attempt to avoid under
or overfitting. Number = 5 indicates the number of cross
validation. P = 0.8 indicates how much of the data
presented will be used during your training phase.
Method = CV indicates how the cross validation will be done
using 5 equal parts. In general, rf_caret will contain a
similar tree object like randomForest generate, however,
randomForest object is straight forward and easier to use
with more packages supporting its format for downstream analysis. One
can use str(rf_randomForest) or str(rf_caret)
to inspect their proprieties.
- Validation
Now that we trained our model, the validation step is essential to establish how much of our training is reproducible before we test our model with new data. This short presentation will limit itself on the loop between training and validation. In practice, validation can be done with:
predmodel <- predict(rf_caret, validation) #or rf_randomForest
cmatrix <- confusionMatrix(predmodel, validation$Outcome)
The confusionMatrix will return a object with many
information about our model performance, such as: accuracy, positive
predictive value, negative predictive value, sensitivity and
specificity. Such data will be important to build our log to analyze our
data retrospectively in the future. Another important metric is F1score
obtained with F_meas function as follows:
f1score <- F_meas(predmodel, validation$Outcome)
Now, to build our log we will need to select which metric from our
cmatrix object is most relevant. As example, we will select
accuracy, precision, recall, F1score and Pvalue of Accuracy > No
Information Rate (NIR)
acc <- cmatrix$overall[["Accuracy"]]
precision <- cmatrix$byClass[["Pos Pred Value"]]
recall <- cmatrix$byClass[["Sensitivity"]]
f1score <- F_meas(predModel, valid_set$Layers)
PNir <- cmatrix$overall[['AccuracyPValue']]
We will need to organize such data to generate a table continuously:
logx <- paste(acc,"|",precision,"|",recall,"|",f1score,"|",PNir, collapse=" | ")
cat(logx, sep="\n")
Finally, we can set which conditions we want to export, example:
if (acc >= .60 & precision > .50 & recall > .50 & f1score >= .70 & PNir < 0.05) {
date_str <- trimws(system("date +%d%m-%H", intern = TRUE))
modelImage <- paste0("./models/bestRf_", date_str, ".RData")
save.image(file = modelImage)
}
In short: when a model shows accuracy > 0.60 and prediction > 0.50 and recall > 0.50 and f1score >= 0.70 and PNir < 0.05 (significant) then generate a file named bestRf_date with a image containing all objects and variables created to reach such results (including seeds, splits, important variables and forest itself). This image can be loaded in another season using: load(image_name). Notice that your image will be saved in a folder called “models”. Create it first.
All R functions need to be added into a file called (example)
RFtraining.R
AUTOMATION
On linux side, a simple bash script to start the run and collect metrics continuously can be written as follows:
#!/bin/bash
currentTime=$(date +%d%m-%H)
echo "Accuracy | Precision | Recall | F1 | pACNir" > ./logs/run_$currentTime
counter=0
bestrf=false
while [ "$bestrf" = false ]; do
counter=$((counter+1))
Rscript RFTraining.R >> ./logs/run_$currentTime
if [[ -e ./models/bestRf_$currentTime.RData ]]; then
bestrf=true
fi
done
Add the code above into a file (e.g: file.sh) and change to be
executable: chmod +x file.sh. Then you can call with:
./file.sh from your terminal.
Notice that our track until we reached our best results will be
exported into a run_currentTime file in a logs folder. Create the folder
first. When the best result is achieved (with our given criteria), R
will export an image. The script above will detect this image and
immediately cease the run. Typically, a RF run for a small dataset won’t
take more than 30 minutes, if your run is taking more time, you may need
to adapt currentTime to keep all your records in one file.
A similar automation can be achieved using https://github.com/bnosac/cronR
Finally, the advantage of our suggestion is the possibility to generate a log with statistics from individual runs, making it possible to track your progress and analyze your dataset behavior across multiple attempts to achieve a desired performance.
Despite the promises, we can observe the following limitations on ML models in health sciences:
Data Quality and Availability ML depends of appropriate data quality, however, acquiring high-quality data in health science is a challenge. The dataset in practice is often incomplete, heterogeneous or a specific population may not represent all populations even when they share the same phenotype [2].
Technical Differences Different forms of data acquisition may represent a noise in ML dataset. [3,4]
DISCLAIMER
Our example is based on advantages and limitations for health sciences and how we can use automation to improve our training, however, the code itself can be applied in any scenario.
REFERENCES
[1] Rajula HSR, Verlato G, Manchia M, Antonucci N, Fanos V. Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment. Medicina (Kaunas). 2020 Sep 8;56(9):455. doi: 10.3390/medicina56090455. PMID: 32911665; PMCID: PMC7560135.
[2] Habehh H, Gohel S. Machine Learning in Healthcare. Curr Genomics. 2021 Dec 16;22(4):291-300. doi: 10.2174/1389202922666210705124359. PMID: 35273459; PMCID: PMC8822225.
[3] Panch T, Szolovits P, Atun R. Artificial intelligence, machine learning and health systems. J Glob Health. 2018 Dec;8(2):020303. doi: 10.7189/jogh.08.020303. PMID: 30405904; PMCID: PMC6199467.
[4] Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019 Oct 29;17(1):195. doi: 10.1186/s12916-019-1426-2. PMID: 31665002; PMCID: PMC6821018.