1 Data preparation

The following steps will show you the steps to prepare the data.

1.1 Load in the dataaset and view statistics

The dataset for the example will be the Thyroid dataset contained in the MLDataR package.

td <- MLDataR::thyroid_disease
skim(td)

Data summary
Name	td
Number of rows	3772
Number of columns	28
_______________________
Column type frequency:
character	2
numeric	26
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ThryroidClass	0	1	4	8	0	2	0
ref_src	0	1	3	5	0	5	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
patient_age	1	1.00	51.63	18.98	1.00	36.00	54.00	67.00	94.00	▁▆▆▇▂
patient_gender	0	1.00	0.66	0.47	0.00	0.00	1.00	1.00	1.00	▅▁▁▁▇
presc_thyroxine	0	1.00	0.12	0.33	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
queried_why_on_thyroxine	0	1.00	0.01	0.11	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
presc_anthyroid_meds	0	1.00	0.01	0.11	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
sick	0	1.00	0.04	0.19	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
pregnant	0	1.00	0.01	0.12	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
thyroid_surgery	0	1.00	0.01	0.12	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
radioactive_iodine_therapyI131	0	1.00	0.02	0.12	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
query_hypothyroid	0	1.00	0.06	0.24	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
query_hyperthyroid	0	1.00	0.06	0.24	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
lithium	0	1.00	0.00	0.07	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
goitre	0	1.00	0.01	0.09	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
tumor	0	1.00	0.03	0.16	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
hypopituitarism	0	1.00	0.00	0.02	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
psych_condition	0	1.00	0.05	0.22	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
TSH_measured	0	1.00	0.90	0.30	0.00	1.00	1.00	1.00	1.00	▁▁▁▁▇
TSH_reading	369	0.90	5.09	24.52	0.00	0.50	1.40	2.70	530.00	▇▁▁▁▁
T3_measured	0	1.00	0.80	0.40	0.00	1.00	1.00	1.00	1.00	▂▁▁▁▇
T3_reading	769	0.80	2.01	0.83	0.05	1.60	2.00	2.40	10.60	▇▅▁▁▁
T4_measured	0	1.00	0.94	0.24	0.00	1.00	1.00	1.00	1.00	▁▁▁▁▇
T4_reading	231	0.94	108.32	35.60	2.00	88.00	103.00	124.00	430.00	▃▇▁▁▁
thyrox_util_rate_T4U_measured	0	1.00	0.90	0.30	0.00	1.00	1.00	1.00	1.00	▁▁▁▁▇
thyrox_util_rate_T4U_reading	387	0.90	0.99	0.20	0.25	0.88	0.98	1.08	2.32	▁▇▂▁▁
FTI_measured	0	1.00	0.90	0.30	0.00	1.00	1.00	1.00	1.00	▁▁▁▁▇
FTI_reading	385	0.90	110.47	33.09	2.00	93.00	107.00	124.00	395.00	▁▇▁▁▁

1.2 Clean imports

We will remove the null values for now, but you could impute these with methods such as MICE or mean/mode/median imputation methods.

td_clean <- td[complete.cases(td),]
dim(td_clean)

## [1] 2751   28

1.3 View class distribution

Next we will view the class distribution of the classification task:

table_class <- table(td_clean$ThryroidClass)
class_imbalance_original <- unclass(prop.table(table_class))[1:2]
print(class_imbalance_original)

## 
##   negative       sick 
## 0.92075609 0.07924391

We will do some over sampling on the sick cases later on in this tutorial, however this level of imbalance will lead to skewed ML models in terms of predicting most patients not to have a thyroid issue.

Smote is the algorithm we will use for dealing with imbalance:

This method is used to obtain a synthetically class-balanced or nearly class-balanced training set, which is then used to train the classifier.

2 Explaratory Data Analysis (EDA)

The EDA component we will build sources an external function from the functions sub folder in our project structure. This function builds the histoplotter function to enable the visualisation of our continuous variables.

# Get continuous variables only
subset <- td_clean %>% 
  dplyr::select(ThryroidClass, patient_age, TSH_reading, T3_reading,
                T4_reading, thyrox_util_rate_T4U_reading,
                FTI_reading)


# Bring in external file for visualisations
source('functions/visualisations.R')

# Use plot function
plot <- histoplotter(subset, ThryroidClass, 
                     chart_x_axis_lbl = 'Thyroid Class', 
                     chart_y_axis_lbl = 'Measures',boxplot_color = 'navy', 
                     boxplot_fill = '#89CFF0', box_fill_transparency = 0.2) 

# Add extras to plot
plot + ggthemes::theme_solarized() + theme(legend.position = 'none') + 
  scale_color_manual(values=c('negative' = 'red', 'positive' = 'blue'))

As you can see - we have a number of outliers in the continuous variables. To deal with this we will apply a standardisation method to bring that variability on to a similar scale by mean centering, or another technique, to reduce the affects of the statistical outliers. Other treatment options could be to expunge these from the analysis via anomaly / outlier detection techniques.

3 Model preparation

The next set of steps will be used to get the data ready for the training of the models - we will have a baseline model and compare against a model known for tearing the tabular data challenges on Kaggle.

3.1 Dividing the data into train/val/test samples

Now we will divide the data into training, validation and test samples:

td_clean <- td_clean %>% 
  dplyr::mutate(ThryroidClass = as.factor(ThryroidClass)) %>% 
  dplyr::select(-ref_src) %>% 
  drop_na()

# Split the dataset 
td_split <-initial_split(td_clean, 
                                   strata = ThryroidClass, 
                                   prop=0.9,
                                   breaks = 4)

train <- training(td_split)
test <- testing(td_split)

Okay, we have the training and testing sample. This sample will be used to assess how accurate the model is on the held out testing set. This will link to the evaluate metrics for the model. We will delve into that later on in this training.

3.2 Getting our model ingredients ready with Recipes

Recipes is a way to simplify the feature engineering process. Back in the old days you had to do each of these steps to the training data prior to fitting a model, especially using package such as caret. Now, you can speed this process up massively with the help of the recipes package. Let’s whip up the recipe:

train_rcp <- recipes::recipe(ThryroidClass ~ ., data=train) %>% 
  themis::step_smote(ThryroidClass, over_ratio = 0.97, neighbors = 3) %>%
  step_zv(all_predictors()) 

# Prep and bake the recipe so we can view this as a seperate data frame
training_df <- train_rcp %>% 
  prep() %>% 
  juice()

## Warning: `terms_select()` was deprecated in recipes 0.1.17.
## Please use `recipes_eval_select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

# Class imbalance resolved
class_imbalance_after_smote <- unclass(prop.table(table(training_df$ThryroidClass)))[1:2]

print(class_imbalance_after_smote)

## 
##  negative      sick 
## 0.5076684 0.4923316

As we applied Synthetic Minority Oversampling - which is a nearest neighbours method of oversampling we need to check what has happened to the binary labels (negative or sick):

imbalance_frame <- tibble(class_imbalance_original,
           class_imbalance_after_smote)

print(imbalance_frame)

## # A tibble: 2 × 2
##   class_imbalance_original class_imbalance_after_smote
##                      <dbl>                       <dbl>
## 1                   0.921                        0.508
## 2                   0.0792                       0.492

This technique is not always successful, due to the severity of the imbalance, the representation of the sick class might make the overall distribution imbalanced.

4 Model training

In this example I will create a baseline model and compare against one further classifier, for the sake of brevity. However, in ML challenges it is common to try many different classifiers and pit them against each other in the evaluation stages.

4.1 Training the Logistic Regression baseline model

The theory is that if a simple linear classifier does a better job than a more complex algorithm, then stick with good old logistic regression. I won’t cover the mathematics of logistic regression, but it follows very closely to a linear regression equation, with the addition that there is a log link function used to turn it from a regressor into a classifier.

4.1.1 Initialising the model

Here I use Parsnip to search for the list of available models:

lr_mod <- parsnip::logistic_reg() %>% 
  set_engine('glm')

print(lr_mod)

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

4.1.2 Creating the model workflow

We will use workflows to create the model workflow:

lr_wf <-
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(train_rcp)

These are easy to explain:

I create a workflow for the model
I add the model that I have initialised in the preceeding step
I add the recipe previously created

Next, I will kick off the training process:

lr_fit <- 
  lr_wf %>% 
  fit(data=train)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

4.1.3 Extracting the fitted data

I want to pull the data fits into a tibble I can explore. This can be done below:

lr_fitted <- lr_fit %>% 
  extract_fit_parsnip() %>% 
  tidy()

I will visualise this via a bar chart to observe my significant features:

lr_fitted_add <- lr_fitted  %>% 
  mutate(Significance = ifelse(p.value < 0.05, 
                               "Significant", "Insignificant")) %>% 
  arrange(desc(p.value)) 
#Create a ggplot object to visualise significance
plot <- lr_fitted_add %>% 
  ggplot(mapping = aes(x=term, y=p.value, fill=Significance)) +
  geom_col() + theme(axis.text.x = element_text(
                                        face="bold", color="#0070BA", 
                                        size=8, angle=90)
                                                ) + labs(y="P value", x="Terms", 
                                                         title="P value significance chart",
                                                         subtitle="A chart to represent the significant variables in the model",
                                                         caption="Produced by Gary Hutson")

plotly::ggplotly(plot)

4.2 Training a tree based boosting model (XGBoost)

There are many ways to improve model performance, but the three main ways are:

Boosting
Bagging
Stacking

There are specific R packages for two of these - for bagging see baguette and for stacking see stacks. Otherwise, these can be implemented in caret by extracting the fit objects from the workflow.

4.2.1 Set up model

Firstly, we are going to repeat the same process as above and then we are going to compare the results that we get from both models to make a decision about which one to push into production.

This time I will hyperparameter tune the number of trees to grow and the depth of the leafs of the tree.

For details of the maths underpinning this model, check out Josh Stamar’s excellent videos: https://www.youtube.com/watch?v=ZVFeW798-2I.

xgboost_mod <- boost_tree(trees=tune(), tree_depth = tune()) %>% 
  set_mode('classification') %>% 
  set_engine('xgboost')

4.2.2 Hyperparameter tuning and K-Fold Cross Validation

Here, as stated, I will do an iterative search for the best parameters to pass to my model:

# Set the selected parameters in the grid
boost_grid <- dials::grid_regular(
  trees(), tree_depth(), levels=5 #Number of combinations to try
)
# Create the resampling method i.e. K Fold Cross Validation
folds <- vfold_cv(train, k=5)

4.2.3 Create XGBoost workflow

I will now implement the workflow to manage the XGBoost model:

xgboost_wf <- workflow() %>%
  add_model(xgboost_mod) %>% 
  add_recipe(train_rcp)

Once I have this I can then go about iterating through the best combinations of fold and hyperparameter:

We will now select the best model:

best_model <- xgboost_fold %>% 
  #select_best('accuracy')
  select_best('roc_auc')

Visualising the results:

xgboost_fold %>% 
  collect_metrics() %>% 
  mutate(tree_depth = factor(tree_depth)) %>% 
  ggplot(aes(trees, mean, color = tree_depth)) +
  geom_line(size=1.5, alpha=0.6) +
  geom_point(size=2) +
  facet_wrap(~ .metric, scales='free', nrow=2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option='plasma', begin=.9, end =0) + theme_minimal()

### Finalise the workflow and fit best model

I will now finalise my workflow by slecting the best hyperparameters for the job:

final_wf <- 
  xgboost_wf %>% 
  finalize_workflow(best_model)

print(final_wf)

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_smote()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   trees = 500
##   tree_depth = 11
## 
## Computational engine: xgboost

# Final fit of our fold and hyperparameter combination

final_xgboost_fit <- 
  final_wf %>% 
  last_fit(td_split)

4.2.4 Collect metrics for evaluation

The final step would be to collect the metrics for evaluation. We will dedicate a seperate section to the evaluation of our models:

final_xgboost_fit %>% 
  collect_metrics()

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.989 Preprocessor1_Model1
## 2 roc_auc  binary         0.995 Preprocessor1_Model1

Next we will look at the workflow fit:

# This extracts the workflow fit
workflow_xgboost_fit <- final_xgboost_fit %>% 
  extract_workflow()

# This extracts the parsnip model
xgboost_model_fit <- final_xgboost_fit %>% 
  extract_fit_parsnip()

5 Model evaluation

As we are following on from the xgboost model building, we will evaluate this first and then compare to our baseline model:

5.1 Use fitted XGBoost model to predict on testing set

The aim here is to check that our predictions match up with our ground truth labels. The class labels will be determined by probabilities that are higher than 0.5, however we are going to tweak this threshold to only allow a label if the probability is greater than 0.7:

# Pass our test data through model
testing_fit_class <- predict(workflow_xgboost_fit, test)
testing_fit_probs <- predict(workflow_xgboost_fit, test, type='prob')
# Bind this on to our test data with the label to compare ground truth vs predicted
predictions<- cbind(test,testing_fit_probs, testing_fit_class) %>%
  dplyr::mutate(xgboost_model_pred=.pred_class,
                xgboost_model_prob=.pred_sick) %>% 
  dplyr::select(everything(), -c(.pred_class, .pred_negative)) %>% 
  dplyr::mutate(xgboost_class_custom = ifelse(xgboost_model_prob >0.7,"sick","negative")) %>% 
  dplyr::select(-.pred_sick)

5.2 Use fitted logistic regression model to predict on test set

We are going to now append our predictions from our model we created as a baseline to append to the predictions we already have in the predictions data frame:

testing_lr_fit_probs <- predict(lr_fit, test, type='prob')
testing_lr_fit_class <- predict(lr_fit, test)

predictions<- cbind(predictions, testing_lr_fit_probs, testing_lr_fit_class)

predictions <- predictions %>% 
  dplyr::mutate(log_reg_model_pred=.pred_class,
                log_reg_model_prob=.pred_sick) %>% 
  dplyr::select(everything(), -c(.pred_class, .pred_negative)) %>% 
  dplyr::mutate(log_reg_class_custom = ifelse(log_reg_model_prob >0.7,"sick","negative")) %>% 
  dplyr::select(-.pred_sick)


# Get a head view of the finalised data
head(predictions)

##    ThryroidClass patient_age patient_gender presc_thyroxine
## 3           sick          80              1               0
## 7       negative          71              1               0
## 28      negative          48              0               0
## 36          sick          64              1               0
## 41      negative          65              0               0
## 58      negative          72              0               0
##    queried_why_on_thyroxine presc_anthyroid_meds sick pregnant thyroid_surgery
## 3                         0                    0    0        0               0
## 7                         0                    0    1        0               0
## 28                        1                    0    0        0               0
## 36                        0                    0    0        0               0
## 41                        0                    0    0        0               0
## 58                        0                    0    0        0               0
##    radioactive_iodine_therapyI131 query_hypothyroid query_hyperthyroid lithium
## 3                               0                 0                  0       0
## 7                               0                 0                  1       0
## 28                              0                 0                  1       0
## 36                              0                 0                  0       0
## 41                              0                 0                  0       0
## 58                              0                 0                  0       0
##    goitre tumor hypopituitarism psych_condition TSH_measured TSH_reading
## 3       0     0               0               0            1       2.200
## 7       0     0               0               0            1       0.030
## 28      0     0               0               0            1       5.400
## 36      0     0               0               0            1       0.035
## 41      0     0               0               0            1      14.800
## 58      0     0               0               0            1       4.100
##    T3_measured T3_reading T4_measured T4_reading thyrox_util_rate_T4U_measured
## 3            1        0.6           1         80                             1
## 7            1        3.8           1        171                             1
## 28           1        1.9           1         87                             1
## 36           1        1.0           1        103                             1
## 41           1        1.5           1         61                             1
## 58           1        1.6           1         94                             1
##    thyrox_util_rate_T4U_reading FTI_measured FTI_reading xgboost_model_pred
## 3                          0.70            1         115               sick
## 7                          1.13            1         151           negative
## 28                         1.00            1          87           negative
## 36                         0.85            1         122               sick
## 41                         0.85            1          72           negative
## 58                         0.92            1         102           negative
##    xgboost_model_prob xgboost_class_custom log_reg_model_pred
## 3        9.998764e-01                 sick               sick
## 7        2.495646e-04             negative           negative
## 28       5.078316e-05             negative           negative
## 36       9.974521e-01                 sick               sick
## 41       6.294250e-05             negative           negative
## 58       3.025532e-04             negative           negative
##    log_reg_model_prob log_reg_class_custom
## 3        9.877476e-01                 sick
## 7        4.866757e-05             negative
## 28       7.474834e-03             negative
## 36       9.220536e-01                 sick
## 41       1.072977e-01             negative
## 58       2.044620e-01             negative

6 Evaluating the models with the ConfusionTableR package

The default caret confusion_matrix function saves everything as a string and doesn’t allow you to work with the values from the output.

This is the problem the ConfusionTableR package solves and means that you can easily store down the variables into a textual output, as and when needed.

6.1 Evaluate Logistic Regression baseline

First, I will evaluate my baseline model using the package:

cm_lr <- ConfusionTableR::binary_class_cm(
  #Here you will have to cast to factor type as the tool expects factors
  train_labels = as.factor(predictions$log_reg_class_custom),
  truth_labels = as.factor(predictions$ThryroidClass),
  positive='sick', mode='everything'
  )

## [INFO] Building a record level confusion matrix to store in dataset

## [INFO] Build finished and to expose record level cm use the record_level_cm list item

# View the confusion matrix native
cm_lr$confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative sick
##   negative      244    2
##   sick            5   25
##                                           
##                Accuracy : 0.9746          
##                  95% CI : (0.9484, 0.9897)
##     No Information Rate : 0.9022          
##     P-Value [Acc > NIR] : 2.344e-06       
##                                           
##                   Kappa : 0.8631          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.92593         
##             Specificity : 0.97992         
##          Pos Pred Value : 0.83333         
##          Neg Pred Value : 0.99187         
##               Precision : 0.83333         
##                  Recall : 0.92593         
##                      F1 : 0.87719         
##              Prevalence : 0.09783         
##          Detection Rate : 0.09058         
##    Detection Prevalence : 0.10870         
##       Balanced Accuracy : 0.95292         
##                                           
##        'Positive' Class : sick            
##

The baseline model performs pretty well. You can see this is the result of fixing our imbalance. Let’s work with it in a row wise fashion, as we can extract some metrics we my be interested in:

# Get record level confusion matrix for logistic regression model
cm_rl_log_reg <- cm_lr$record_level_cm
accuracy_frame <- tibble(
  Accuracy=cm_rl_log_reg$Accuracy,
  Kappa=cm_rl_log_reg$Kappa,
  Precision=cm_rl_log_reg$Precision,
  Recall=cm_rl_log_reg$Recall
)

6.2 Evaluate XGBoost Model

The next stage is to evaluate the XGBoost baseline. We will use this final evaluation to compare with our baseline model. Note: in reality this would be compared across many models:

cm_xgb <- ConfusionTableR::binary_class_cm(
  #Here you will have to cast to factor type as the tool expects factors
  train_labels = as.factor(predictions$xgboost_class_custom),
  truth_labels = as.factor(predictions$ThryroidClass),
  positive='sick', mode='everything'
  )

## [INFO] Building a record level confusion matrix to store in dataset

## [INFO] Build finished and to expose record level cm use the record_level_cm list item

# View the confusion matrix native
cm_xgb$confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative sick
##   negative      246    0
##   sick            3   27
##                                           
##                Accuracy : 0.9891          
##                  95% CI : (0.9686, 0.9978)
##     No Information Rate : 0.9022          
##     P-Value [Acc > NIR] : 2.239e-09       
##                                           
##                   Kappa : 0.9413          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.00000         
##             Specificity : 0.98795         
##          Pos Pred Value : 0.90000         
##          Neg Pred Value : 1.00000         
##               Precision : 0.90000         
##                  Recall : 1.00000         
##                      F1 : 0.94737         
##              Prevalence : 0.09783         
##          Detection Rate : 0.09783         
##    Detection Prevalence : 0.10870         
##       Balanced Accuracy : 0.99398         
##                                           
##        'Positive' Class : sick            
##

I will now extract the predictions and then I will bind the predictions on to the original frame to view what the difference is:

# Get record level confusion matrix for the XGBoost model
cm_rl_xgboost <- cm_xgb$record_level_cm

accuracy_frame_xg <- tibble(
  Accuracy=cm_rl_xgboost$Accuracy,
  Kappa=cm_rl_xgboost$Kappa,
  Precision=cm_rl_xgboost$Precision,
  Recall=cm_rl_xgboost$Recall
)

# Bind the rows from the previous frame 
accuracy_frame <- rbind(accuracy_frame, accuracy_frame_xg)
rm(accuracy_frame_xg)

Comparing the two confusion matrices we have two different models, in reality we would test multiple models, with multiple hyperparameters and multiple splits.

That is an example of how to rebalance and improve on the baseline model. Now I will take the fit from our test model and deploy with a new R MLOps package called vetiver.

7 MLOps- putting model into production with Vetiver

The steps to deploy a model with vetiver is to: 1. Version 2. Deploy 3. Monitor

The subsections hereunder will show you how to do this.

7.1 Versioning with Vetiver

I will demonstrate how to deploy our original baseline model. At the moment vetiver serialisation of the model is not supported. The TidyModels team are addressing this and will update their GitHub ticket.

Initialising our vetiver model object:

vet_lr_mod <- vetiver_model(lr_fit, "logistic_regression_model")

The next phase is to store and version our model, so if it is retrained, the version can be extracted to roll back to previous model serialisations:

library(pins)
model_board <- board_temp(versioned = TRUE)
model_board %>% vetiver::vetiver_pin_write(vet_lr_mod)

## Creating new version '20221101T130051Z-a5816'
## Writing to pin 'logistic_regression_model'
## 
## Create a Model Card for your published model
## • Model Cards provide a framework for transparent, responsible reporting
## • Use the vetiver `.Rmd` template as a place to start

model_board %>% pin_versions("logistic_regression_model")

## # A tibble: 1 × 3
##   version                created             hash 
##   <chr>                  <dttm>              <chr>
## 1 20221101T130051Z-a5816 2022-11-01 13:00:51 a5816

7.2 Deploying with vetiver

We will create a restful API for our deployment of our logistic regression baseline model with vetiver.

7.2.1 Create a REST API

We will use Plumber here, as this allows for quickly deploying web services. See my tutorial on creating a REST API from scratch with Plumber: https://github.com/StatsGary/NHS_R_Community_Intro_to_Docker.

library(plumber)
library(vetiver)
pr() %>% 
  vetiver_api(vet_lr_mod) #%>%

## # Plumber router with 2 endpoints, 4 filters, and 1 sub-router.
## # Use `pr_run()` on this object to start the API.
## ├──[queryString]
## ├──[body]
## ├──[cookieParser]
## ├──[sharedSecret]
## ├──/logo
## │  │ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/vetiver
## ├──/ping (GET)
## └──/predict (POST)

  #pr_run()

# Write the plumber file
vetiver_write_plumber(model_board, 'logistic_regression_model')

7.2.2 Making deployment easy with RStudio Connect

To simply deploy a vetiver endpoint to R Studio connect, follow this command below:

#vetiver_deploy_rsconnect(model_board, "logistic_regression_model")

If you are deploying to any other platform i.e. GCP, AWS, Cloud Run or MS Azure, you would need to create a microservice and store it in the container registry of the relevant cloud provider. I go into how to deploy your app as an endpoint on Docker here: https://www.youtube.com/watch?v=WMCkV_J5a0s.

7.2.3 Generate Docker file with Vetiver

To generate your docker files, you can use the below command to generate the doc for container deployment in a Docker miroservice:

vetiver_write_docker(vet_lr_mod)

## The version of R recorded in the lockfile will be updated:
## - R              [*] -> [4.1.2]
## 
## * Lockfile written to 'vetiver_renv.lock'.

7.2.4 Make a prediction with Vetiver endpoint

The first thing to do is set up your endpoint:

endpoint <- vetiver_endpoint("http://127.0.0.1:8080/predict")
print(endpoint)

Here the port number (8080) must match that of the port stated. In my case port 8080 is open on my API to connect to and the predict function will allow you to pass requests to and from the endpoint.

7.2.4.1 Set up production data to test endpoint

Here, we will set up our data to make our fields in our training set:

# Get the structure of train 
str(train)
names(train)

# New patient
prod_patient <- tibble(
  patient_age = 40, patient_gender = 1,
  presc_thyroxine = 0, queried_why_on_thyroxine = 0,
  presc_anthyroid_meds = 1, sick = 0, 
  pregnant = 1, thyroid_surgery = 1, 
  radioactive_iodine_therapyI131 = 0, query_hypothyroid  = 0, 
  query_hyperthyroid  = 1, lithium = 0, goitre = 0, tumor = 0, 
  hypopituitarism = 0, psych_condition = 0, TSH_measured = 1,
  TSH_reading = 2.0, T3_measured = 1, T3_reading = 2.2,
  T4_measured = 1, T4_reading = 85, thyrox_util_rate_T4U_measured = 1,
  thyrox_util_rate_T4U_reading = 0.93, FTI_measured = 1, 
  FTI_reading = 109
)

The step after this would be to predict against our endpoint a new patient:

predict(endpoint, prod_patient)

This allows you to predict against an active endpoint and simplifies the whole process of Docker file completion.

Building a TidyModel for Classification from scratch

Gary Hutson - Head of Machine Learning