Machine learning techniques can be used to analyze medical data and empower clinicians in their decision making processes. Eventhough these models have proven to be very effective in other fields, it is important to keep in mind that the application of machine learning, artificail intelligence and computational statistics to the clinical sciences should always come with close supervision by professionals who understand both approaches.

This document shows some applications that could significantly impact the decision making process in the clinical context.

Preprocessing Data

A database from diabetes in the Pima Indians is used for this example:

library(healthcareai)
## healthcareai version 2.0.0
## Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com
str(pima_diabetes)
## Classes 'tbl_df', 'tbl' and 'data.frame':    768 obs. of  10 variables:
##  $ patient_id    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pregnancies   : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ plasma_glucose: int  148 85 183 89 137 116 78 115 197 125 ...
##  $ diastolic_bp  : int  72 66 64 66 40 74 50 NA 70 96 ...
##  $ skinfold      : int  35 29 NA 23 35 NA 32 NA 45 NA ...
##  $ insulin       : int  NA NA NA 94 168 NA 88 NA 543 NA ...
##  $ weight_class  : chr  "obese" "overweight" "normal" "overweight" ...
##  $ pedigree      : num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age           : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes      : chr  "Y" "N" "Y" "N" ...

Since the objective here is to find the algorithm that best classifies to predict who will have diabetes and who won´t, a subset of classification algorithms is trained and evaluated to find the best option.

quick_models <- machine_learn(pima_diabetes, patient_id, outcome = diabetes)
## Training new data prep recipe
## Variable(s) ignored in prep_data won't be used to tune models: patient_id
## diabetes looks categorical, so training classification algorithms.
## Training with cross validation: Random Forest
## Training with cross validation: k-Nearest Neighbors
## 
## *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***
## *** If there was PHI in training data, normal PHI protocols apply to the model object. ***

Once the trainning process is finished, the highest performing algorithm can be selected:

quick_models
## Algorithms Trained: Random Forest, k-Nearest Neighbors
## Target: diabetes
## Class: Classification
## Performance Metric: AUROC
## Number of Observations: 768
## Number of Features: 12
## Models Trained: 2018-06-26 12:11:55 
## 
## Models tuned via 5-fold cross validation over 10 combinations of hyperparameter values.
## Best model: Random Forest
## AUPR = 0.72, AUROC = 0.84
## Optimal hyperparameter values:
##   mtry = 2
##   splitrule = extratrees
##   min.node.size = 9

It is important to highlight the area under a ROC result: 0.85, which can be interpreted as good. This means that we can proceed to a classification process with this model.

predictions <- predict(quick_models)
predictions
## # A tibble: 768 x 14
##    diabetes predicted_diabetes pregnancies plasma_glucose diastolic_bp
##  * <fct>                 <dbl>       <int>          <dbl>        <dbl>
##  1 Y                    0.579            6            148         72  
##  2 N                    0.177            1             85         66  
##  3 Y                    0.345            8            183         64  
##  4 N                    0.0814           1             89         66  
##  5 Y                    0.499            0            137         40  
##  6 N                    0.281            5            116         74  
##  7 Y                    0.208            3             78         50  
##  8 N                    0.410           10            115         72.4
##  9 Y                    0.690            2            197         70  
## 10 Y                    0.352            8            125         96  
## # ... with 758 more rows, and 9 more variables: skinfold <dbl>,
## #   insulin <dbl>, pedigree <dbl>, age <int>, weight_class_normal <dbl>,
## #   weight_class_obese <dbl>, weight_class_overweight <dbl>,
## #   weight_class_other <dbl>, weight_class_missing <dbl>
plot(predictions)

Data preparation

split_data <- split_train_test(d = pima_diabetes,
                               outcome = diabetes,
                               p = .9,
                               seed = 84105)

prepped_training_data <- prep_data(split_data$train, patient_id, outcome = diabetes,
                                   center = TRUE, scale = TRUE,
                                   collapse_rare_factors = FALSE)

Model Training

models <- tune_models(d = prepped_training_data,
                      outcome = diabetes,
                      models = "RF",
                      tune_depth = 25,
                      metric = "PR")
## Variable(s) ignored in prep_data won't be used to tune models: patient_id
## diabetes looks categorical, so training classification algorithms.
## You've chosen to tune 125 models (n_folds = 5 x tune_depth = 25 x length(models) = 1) on a 692 row dataset. This may take a while...
## Training with cross validation: Random Forest
## 
## *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***
## *** If there was PHI in training data, normal PHI protocols apply to the model object. ***

Clinical importance of each variable

get_variable_importance(models) %>%
  plot()

This image is very important for clinical use, because it facilitates the visualization of each variable and how much it imapcts the final result (having diabetes). This analysis can be implemented to different pathologies.