The following steps will show you the steps to prepare the data.
The dataset for the example will be the Thyroid dataset contained in
the MLDataR package.
td <- MLDataR::thyroid_disease
skim(td)| Name | td |
| Number of rows | 3772 |
| Number of columns | 28 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 26 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ThryroidClass | 0 | 1 | 4 | 8 | 0 | 2 | 0 |
| ref_src | 0 | 1 | 3 | 5 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| patient_age | 1 | 1.00 | 51.63 | 18.98 | 1.00 | 36.00 | 54.00 | 67.00 | 94.00 | ▁▆▆▇▂ |
| patient_gender | 0 | 1.00 | 0.66 | 0.47 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
| presc_thyroxine | 0 | 1.00 | 0.12 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| queried_why_on_thyroxine | 0 | 1.00 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| presc_anthyroid_meds | 0 | 1.00 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| sick | 0 | 1.00 | 0.04 | 0.19 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| pregnant | 0 | 1.00 | 0.01 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| thyroid_surgery | 0 | 1.00 | 0.01 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| radioactive_iodine_therapyI131 | 0 | 1.00 | 0.02 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| query_hypothyroid | 0 | 1.00 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| query_hyperthyroid | 0 | 1.00 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| lithium | 0 | 1.00 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| goitre | 0 | 1.00 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| tumor | 0 | 1.00 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hypopituitarism | 0 | 1.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| psych_condition | 0 | 1.00 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| TSH_measured | 0 | 1.00 | 0.90 | 0.30 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
| TSH_reading | 369 | 0.90 | 5.09 | 24.52 | 0.00 | 0.50 | 1.40 | 2.70 | 530.00 | ▇▁▁▁▁ |
| T3_measured | 0 | 1.00 | 0.80 | 0.40 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▂▁▁▁▇ |
| T3_reading | 769 | 0.80 | 2.01 | 0.83 | 0.05 | 1.60 | 2.00 | 2.40 | 10.60 | ▇▅▁▁▁ |
| T4_measured | 0 | 1.00 | 0.94 | 0.24 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
| T4_reading | 231 | 0.94 | 108.32 | 35.60 | 2.00 | 88.00 | 103.00 | 124.00 | 430.00 | ▃▇▁▁▁ |
| thyrox_util_rate_T4U_measured | 0 | 1.00 | 0.90 | 0.30 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
| thyrox_util_rate_T4U_reading | 387 | 0.90 | 0.99 | 0.20 | 0.25 | 0.88 | 0.98 | 1.08 | 2.32 | ▁▇▂▁▁ |
| FTI_measured | 0 | 1.00 | 0.90 | 0.30 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
| FTI_reading | 385 | 0.90 | 110.47 | 33.09 | 2.00 | 93.00 | 107.00 | 124.00 | 395.00 | ▁▇▁▁▁ |
We will remove the null values for now, but you could impute these with methods such as MICE or mean/mode/median imputation methods.
td_clean <- td[complete.cases(td),]
dim(td_clean)## [1] 2751 28
Next we will view the class distribution of the classification task:
table_class <- table(td_clean$ThryroidClass)
class_imbalance_original <- unclass(prop.table(table_class))[1:2]
print(class_imbalance_original)##
## negative sick
## 0.92075609 0.07924391
We will do some over sampling on the sick cases later on in this tutorial, however this level of imbalance will lead to skewed ML models in terms of predicting most patients not to have a thyroid issue.
Smote is the algorithm we will use for dealing with imbalance:
This method is used to obtain a
synthetically class-balanced or nearly class-balanced training set,
which is then used to train the classifier.
The EDA component we will build sources an external function from the
functions sub folder in our project structure. This function builds the
histoplotter function to enable the visualisation of our
continuous variables.
# Get continuous variables only
subset <- td_clean %>%
dplyr::select(ThryroidClass, patient_age, TSH_reading, T3_reading,
T4_reading, thyrox_util_rate_T4U_reading,
FTI_reading)
# Bring in external file for visualisations
source('functions/visualisations.R')
# Use plot function
plot <- histoplotter(subset, ThryroidClass,
chart_x_axis_lbl = 'Thyroid Class',
chart_y_axis_lbl = 'Measures',boxplot_color = 'navy',
boxplot_fill = '#89CFF0', box_fill_transparency = 0.2)
# Add extras to plot
plot + ggthemes::theme_solarized() + theme(legend.position = 'none') +
scale_color_manual(values=c('negative' = 'red', 'positive' = 'blue'))As you can see - we have a number of outliers in the continuous variables. To deal with this we will apply a standardisation method to bring that variability on to a similar scale by mean centering, or another technique, to reduce the affects of the statistical outliers. Other treatment options could be to expunge these from the analysis via anomaly / outlier detection techniques.
The next set of steps will be used to get the data ready for the training of the models - we will have a baseline model and compare against a model known for tearing the tabular data challenges on Kaggle.
Now we will divide the data into training, validation and test samples:
td_clean <- td_clean %>%
dplyr::mutate(ThryroidClass = as.factor(ThryroidClass)) %>%
dplyr::select(-ref_src) %>%
drop_na()
# Split the dataset
td_split <-initial_split(td_clean,
strata = ThryroidClass,
prop=0.9,
breaks = 4)
train <- training(td_split)
test <- testing(td_split)Okay, we have the training and testing sample. This sample will be used to assess how accurate the model is on the held out testing set. This will link to the evaluate metrics for the model. We will delve into that later on in this training.
Recipes is a way to simplify the feature engineering process. Back in
the old days you had to do each of these steps to the training data
prior to fitting a model, especially using package such as
caret. Now, you can speed this process up massively with
the help of the recipes package. Let’s whip up the recipe:
train_rcp <- recipes::recipe(ThryroidClass ~ ., data=train) %>%
themis::step_smote(ThryroidClass, over_ratio = 0.97, neighbors = 3) %>%
step_zv(all_predictors())
# Prep and bake the recipe so we can view this as a seperate data frame
training_df <- train_rcp %>%
prep() %>%
juice()## Warning: `terms_select()` was deprecated in recipes 0.1.17.
## Please use `recipes_eval_select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# Class imbalance resolved
class_imbalance_after_smote <- unclass(prop.table(table(training_df$ThryroidClass)))[1:2]
print(class_imbalance_after_smote)##
## negative sick
## 0.5076684 0.4923316
As we applied Synthetic Minority Oversampling - which is a nearest neighbours method of oversampling we need to check what has happened to the binary labels (negative or sick):
imbalance_frame <- tibble(class_imbalance_original,
class_imbalance_after_smote)
print(imbalance_frame)## # A tibble: 2 × 2
## class_imbalance_original class_imbalance_after_smote
## <dbl> <dbl>
## 1 0.921 0.508
## 2 0.0792 0.492
This technique is not always successful, due to the severity of the imbalance, the representation of the sick class might make the overall distribution imbalanced.
In this example I will create a baseline model and compare against one further classifier, for the sake of brevity. However, in ML challenges it is common to try many different classifiers and pit them against each other in the evaluation stages.
The theory is that if a simple linear classifier does a better job than a more complex algorithm, then stick with good old logistic regression. I won’t cover the mathematics of logistic regression, but it follows very closely to a linear regression equation, with the addition that there is a log link function used to turn it from a regressor into a classifier.
Here I use Parsnip to search for the list of available models:
lr_mod <- parsnip::logistic_reg() %>%
set_engine('glm')
print(lr_mod)## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
We will use workflows to create the model workflow:
lr_wf <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(train_rcp)These are easy to explain:
Next, I will kick off the training process:
lr_fit <-
lr_wf %>%
fit(data=train)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
I want to pull the data fits into a tibble I can explore. This can be done below:
lr_fitted <- lr_fit %>%
extract_fit_parsnip() %>%
tidy()I will visualise this via a bar chart to observe my significant features:
lr_fitted_add <- lr_fitted %>%
mutate(Significance = ifelse(p.value < 0.05,
"Significant", "Insignificant")) %>%
arrange(desc(p.value))
#Create a ggplot object to visualise significance
plot <- lr_fitted_add %>%
ggplot(mapping = aes(x=term, y=p.value, fill=Significance)) +
geom_col() + theme(axis.text.x = element_text(
face="bold", color="#0070BA",
size=8, angle=90)
) + labs(y="P value", x="Terms",
title="P value significance chart",
subtitle="A chart to represent the significant variables in the model",
caption="Produced by Gary Hutson")
plotly::ggplotly(plot) There are many ways to improve model performance, but the three main ways are:
There are specific R packages for two of these - for bagging see baguette and for stacking see stacks. Otherwise, these can be implemented in caret by extracting the fit objects from the workflow.
Firstly, we are going to repeat the same process as above and then we are going to compare the results that we get from both models to make a decision about which one to push into production.
This time I will hyperparameter tune the number of trees to grow and the depth of the leafs of the tree.
For details of the maths underpinning this model, check out Josh Stamar’s excellent videos: https://www.youtube.com/watch?v=ZVFeW798-2I.
xgboost_mod <- boost_tree(trees=tune(), tree_depth = tune()) %>%
set_mode('classification') %>%
set_engine('xgboost')Here, as stated, I will do an iterative search for the best parameters to pass to my model:
# Set the selected parameters in the grid
boost_grid <- dials::grid_regular(
trees(), tree_depth(), levels=5 #Number of combinations to try
)
# Create the resampling method i.e. K Fold Cross Validation
folds <- vfold_cv(train, k=5)I will now implement the workflow to manage the XGBoost model:
xgboost_wf <- workflow() %>%
add_model(xgboost_mod) %>%
add_recipe(train_rcp)Once I have this I can then go about iterating through the best combinations of fold and hyperparameter:
We will now select the best model:
best_model <- xgboost_fold %>%
#select_best('accuracy')
select_best('roc_auc')Visualising the results:
xgboost_fold %>%
collect_metrics() %>%
mutate(tree_depth = factor(tree_depth)) %>%
ggplot(aes(trees, mean, color = tree_depth)) +
geom_line(size=1.5, alpha=0.6) +
geom_point(size=2) +
facet_wrap(~ .metric, scales='free', nrow=2) +
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option='plasma', begin=.9, end =0) + theme_minimal()
### Finalise the workflow and fit best model
I will now finalise my workflow by slecting the best hyperparameters for the job:
final_wf <-
xgboost_wf %>%
finalize_workflow(best_model)
print(final_wf)## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_smote()
## • step_zv()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## trees = 500
## tree_depth = 11
##
## Computational engine: xgboost
# Final fit of our fold and hyperparameter combination
final_xgboost_fit <-
final_wf %>%
last_fit(td_split)The final step would be to collect the metrics for evaluation. We will dedicate a seperate section to the evaluation of our models:
final_xgboost_fit %>%
collect_metrics()## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.989 Preprocessor1_Model1
## 2 roc_auc binary 0.995 Preprocessor1_Model1
Next we will look at the workflow fit:
# This extracts the workflow fit
workflow_xgboost_fit <- final_xgboost_fit %>%
extract_workflow()
# This extracts the parsnip model
xgboost_model_fit <- final_xgboost_fit %>%
extract_fit_parsnip()As we are following on from the xgboost model building, we will evaluate this first and then compare to our baseline model:
The aim here is to check that our predictions match up
with our ground truth labels. The class labels will be
determined by probabilities that are higher than 0.5, however we are
going to tweak this threshold to only allow a label if the probability
is greater than 0.7:
# Pass our test data through model
testing_fit_class <- predict(workflow_xgboost_fit, test)
testing_fit_probs <- predict(workflow_xgboost_fit, test, type='prob')
# Bind this on to our test data with the label to compare ground truth vs predicted
predictions<- cbind(test,testing_fit_probs, testing_fit_class) %>%
dplyr::mutate(xgboost_model_pred=.pred_class,
xgboost_model_prob=.pred_sick) %>%
dplyr::select(everything(), -c(.pred_class, .pred_negative)) %>%
dplyr::mutate(xgboost_class_custom = ifelse(xgboost_model_prob >0.7,"sick","negative")) %>%
dplyr::select(-.pred_sick)We are going to now append our predictions from our model we created
as a baseline to append to the predictions we already have in the
predictions data frame:
testing_lr_fit_probs <- predict(lr_fit, test, type='prob')
testing_lr_fit_class <- predict(lr_fit, test)
predictions<- cbind(predictions, testing_lr_fit_probs, testing_lr_fit_class)
predictions <- predictions %>%
dplyr::mutate(log_reg_model_pred=.pred_class,
log_reg_model_prob=.pred_sick) %>%
dplyr::select(everything(), -c(.pred_class, .pred_negative)) %>%
dplyr::mutate(log_reg_class_custom = ifelse(log_reg_model_prob >0.7,"sick","negative")) %>%
dplyr::select(-.pred_sick)
# Get a head view of the finalised data
head(predictions)## ThryroidClass patient_age patient_gender presc_thyroxine
## 3 sick 80 1 0
## 7 negative 71 1 0
## 28 negative 48 0 0
## 36 sick 64 1 0
## 41 negative 65 0 0
## 58 negative 72 0 0
## queried_why_on_thyroxine presc_anthyroid_meds sick pregnant thyroid_surgery
## 3 0 0 0 0 0
## 7 0 0 1 0 0
## 28 1 0 0 0 0
## 36 0 0 0 0 0
## 41 0 0 0 0 0
## 58 0 0 0 0 0
## radioactive_iodine_therapyI131 query_hypothyroid query_hyperthyroid lithium
## 3 0 0 0 0
## 7 0 0 1 0
## 28 0 0 1 0
## 36 0 0 0 0
## 41 0 0 0 0
## 58 0 0 0 0
## goitre tumor hypopituitarism psych_condition TSH_measured TSH_reading
## 3 0 0 0 0 1 2.200
## 7 0 0 0 0 1 0.030
## 28 0 0 0 0 1 5.400
## 36 0 0 0 0 1 0.035
## 41 0 0 0 0 1 14.800
## 58 0 0 0 0 1 4.100
## T3_measured T3_reading T4_measured T4_reading thyrox_util_rate_T4U_measured
## 3 1 0.6 1 80 1
## 7 1 3.8 1 171 1
## 28 1 1.9 1 87 1
## 36 1 1.0 1 103 1
## 41 1 1.5 1 61 1
## 58 1 1.6 1 94 1
## thyrox_util_rate_T4U_reading FTI_measured FTI_reading xgboost_model_pred
## 3 0.70 1 115 sick
## 7 1.13 1 151 negative
## 28 1.00 1 87 negative
## 36 0.85 1 122 sick
## 41 0.85 1 72 negative
## 58 0.92 1 102 negative
## xgboost_model_prob xgboost_class_custom log_reg_model_pred
## 3 9.998764e-01 sick sick
## 7 2.495646e-04 negative negative
## 28 5.078316e-05 negative negative
## 36 9.974521e-01 sick sick
## 41 6.294250e-05 negative negative
## 58 3.025532e-04 negative negative
## log_reg_model_prob log_reg_class_custom
## 3 9.877476e-01 sick
## 7 4.866757e-05 negative
## 28 7.474834e-03 negative
## 36 9.220536e-01 sick
## 41 1.072977e-01 negative
## 58 2.044620e-01 negative
The default caret confusion_matrix function saves
everything as a string and doesn’t allow you to work with the values
from the output.
This is the problem the ConfusionTableR package solves and means that you can easily store down the variables into a textual output, as and when needed.
First, I will evaluate my baseline model using the package:
cm_lr <- ConfusionTableR::binary_class_cm(
#Here you will have to cast to factor type as the tool expects factors
train_labels = as.factor(predictions$log_reg_class_custom),
truth_labels = as.factor(predictions$ThryroidClass),
positive='sick', mode='everything'
)## [INFO] Building a record level confusion matrix to store in dataset
## [INFO] Build finished and to expose record level cm use the record_level_cm list item
# View the confusion matrix native
cm_lr$confusion_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction negative sick
## negative 244 2
## sick 5 25
##
## Accuracy : 0.9746
## 95% CI : (0.9484, 0.9897)
## No Information Rate : 0.9022
## P-Value [Acc > NIR] : 2.344e-06
##
## Kappa : 0.8631
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.92593
## Specificity : 0.97992
## Pos Pred Value : 0.83333
## Neg Pred Value : 0.99187
## Precision : 0.83333
## Recall : 0.92593
## F1 : 0.87719
## Prevalence : 0.09783
## Detection Rate : 0.09058
## Detection Prevalence : 0.10870
## Balanced Accuracy : 0.95292
##
## 'Positive' Class : sick
##
The baseline model performs pretty well. You can see this is the result of fixing our imbalance. Let’s work with it in a row wise fashion, as we can extract some metrics we my be interested in:
# Get record level confusion matrix for logistic regression model
cm_rl_log_reg <- cm_lr$record_level_cm
accuracy_frame <- tibble(
Accuracy=cm_rl_log_reg$Accuracy,
Kappa=cm_rl_log_reg$Kappa,
Precision=cm_rl_log_reg$Precision,
Recall=cm_rl_log_reg$Recall
)The next stage is to evaluate the XGBoost baseline. We will use this
final evaluation to compare with our baseline model.
Note: in reality this would be compared across many models:
cm_xgb <- ConfusionTableR::binary_class_cm(
#Here you will have to cast to factor type as the tool expects factors
train_labels = as.factor(predictions$xgboost_class_custom),
truth_labels = as.factor(predictions$ThryroidClass),
positive='sick', mode='everything'
)## [INFO] Building a record level confusion matrix to store in dataset
## [INFO] Build finished and to expose record level cm use the record_level_cm list item
# View the confusion matrix native
cm_xgb$confusion_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction negative sick
## negative 246 0
## sick 3 27
##
## Accuracy : 0.9891
## 95% CI : (0.9686, 0.9978)
## No Information Rate : 0.9022
## P-Value [Acc > NIR] : 2.239e-09
##
## Kappa : 0.9413
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.00000
## Specificity : 0.98795
## Pos Pred Value : 0.90000
## Neg Pred Value : 1.00000
## Precision : 0.90000
## Recall : 1.00000
## F1 : 0.94737
## Prevalence : 0.09783
## Detection Rate : 0.09783
## Detection Prevalence : 0.10870
## Balanced Accuracy : 0.99398
##
## 'Positive' Class : sick
##
I will now extract the predictions and then I will bind the predictions on to the original frame to view what the difference is:
# Get record level confusion matrix for the XGBoost model
cm_rl_xgboost <- cm_xgb$record_level_cm
accuracy_frame_xg <- tibble(
Accuracy=cm_rl_xgboost$Accuracy,
Kappa=cm_rl_xgboost$Kappa,
Precision=cm_rl_xgboost$Precision,
Recall=cm_rl_xgboost$Recall
)
# Bind the rows from the previous frame
accuracy_frame <- rbind(accuracy_frame, accuracy_frame_xg)
rm(accuracy_frame_xg)Comparing the two confusion matrices we have two different models, in reality we would test multiple models, with multiple hyperparameters and multiple splits.
That is an example of how to rebalance and improve on the baseline
model. Now I will take the fit from our test model and deploy with a new
R MLOps package called vetiver.
The steps to deploy a model with vetiver is to: 1. Version 2. Deploy 3. Monitor
The subsections hereunder will show you how to do this.
I will demonstrate how to deploy our original baseline model. At the moment vetiver serialisation of the model is not supported. The TidyModels team are addressing this and will update their GitHub ticket.
Initialising our vetiver model object:
vet_lr_mod <- vetiver_model(lr_fit, "logistic_regression_model")The next phase is to store and version our model, so if it is retrained, the version can be extracted to roll back to previous model serialisations:
library(pins)
model_board <- board_temp(versioned = TRUE)
model_board %>% vetiver::vetiver_pin_write(vet_lr_mod)## Creating new version '20221101T130051Z-a5816'
## Writing to pin 'logistic_regression_model'
##
## Create a Model Card for your published model
## • Model Cards provide a framework for transparent, responsible reporting
## • Use the vetiver `.Rmd` template as a place to start
model_board %>% pin_versions("logistic_regression_model")## # A tibble: 1 × 3
## version created hash
## <chr> <dttm> <chr>
## 1 20221101T130051Z-a5816 2022-11-01 13:00:51 a5816
We will create a restful API for our deployment of our logistic
regression baseline model with vetiver.
We will use Plumber here, as this allows for quickly deploying web services. See my tutorial on creating a REST API from scratch with Plumber: https://github.com/StatsGary/NHS_R_Community_Intro_to_Docker.
library(plumber)
library(vetiver)
pr() %>%
vetiver_api(vet_lr_mod) #%>% ## # Plumber router with 2 endpoints, 4 filters, and 1 sub-router.
## # Use `pr_run()` on this object to start the API.
## ├──[queryString]
## ├──[body]
## ├──[cookieParser]
## ├──[sharedSecret]
## ├──/logo
## │ │ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/vetiver
## ├──/ping (GET)
## └──/predict (POST)
#pr_run()
# Write the plumber file
vetiver_write_plumber(model_board, 'logistic_regression_model')To simply deploy a vetiver endpoint to R Studio connect, follow this command below:
#vetiver_deploy_rsconnect(model_board, "logistic_regression_model")If you are deploying to any other platform i.e. GCP, AWS, Cloud Run or MS Azure, you would need to create a microservice and store it in the container registry of the relevant cloud provider. I go into how to deploy your app as an endpoint on Docker here: https://www.youtube.com/watch?v=WMCkV_J5a0s.
To generate your docker files, you can use the below command to generate the doc for container deployment in a Docker miroservice:
vetiver_write_docker(vet_lr_mod)## The version of R recorded in the lockfile will be updated:
## - R [*] -> [4.1.2]
##
## * Lockfile written to 'vetiver_renv.lock'.
The first thing to do is set up your endpoint:
endpoint <- vetiver_endpoint("http://127.0.0.1:8080/predict")
print(endpoint)Here the port number (8080) must match that of the port stated. In my case port 8080 is open on my API to connect to and the predict function will allow you to pass requests to and from the endpoint.
Here, we will set up our data to make our fields in our training set:
# Get the structure of train
str(train)
names(train)
# New patient
prod_patient <- tibble(
patient_age = 40, patient_gender = 1,
presc_thyroxine = 0, queried_why_on_thyroxine = 0,
presc_anthyroid_meds = 1, sick = 0,
pregnant = 1, thyroid_surgery = 1,
radioactive_iodine_therapyI131 = 0, query_hypothyroid = 0,
query_hyperthyroid = 1, lithium = 0, goitre = 0, tumor = 0,
hypopituitarism = 0, psych_condition = 0, TSH_measured = 1,
TSH_reading = 2.0, T3_measured = 1, T3_reading = 2.2,
T4_measured = 1, T4_reading = 85, thyrox_util_rate_T4U_measured = 1,
thyrox_util_rate_T4U_reading = 0.93, FTI_measured = 1,
FTI_reading = 109
)The step after this would be to predict against our endpoint a new patient:
predict(endpoint, prod_patient)This allows you to predict against an active endpoint and simplifies the whole process of Docker file completion.