setwd("D:/Class Materials & Work/Summer 2020 practice/4_Tidymodel_Parameter tuning")
getwd()
## [1] "D:/Class Materials & Work/Summer 2020 practice/4_Tidymodel_Parameter tuning"

In building a model, hyperparameters such as the number of predictors that are sampled at splits in a tree-based model (mtry) or the learning rate in a boosted tree model (learn_rate) can not be learned direcyly from a data set during model training.

Instead, these hyperparameters are estimated their best value from multiple model trainings on resampled data set, as well as exploring how well all these models perform. This process is called Parameter Tuning.
Loading required packages

library(tidymodels)  # for the tune package, along with the rest of tidymodels
library(modeldata)   # for the cells data
library(vip)         # for variable importance plots

#Cell Image Data-Revisited----

Loading the cell data set.

data(cells, package = "modeldata")
cells
## # A tibble: 2,019 x 58
##    case  class angle_ch_1 area_ch_1 avg_inten_ch_1 avg_inten_ch_2 avg_inten_ch_3
##    <fct> <fct>      <dbl>     <int>          <dbl>          <dbl>          <dbl>
##  1 Test  PS        143.         185           15.7           4.95           9.55
##  2 Train PS        134.         819           31.9         207.            69.9 
##  3 Train WS        107.         431           28.0         116.            63.9 
##  4 Train PS         69.2        298           19.5         102.            28.2 
##  5 Test  PS          2.89       285           24.3         112.            20.5 
##  6 Test  WS         40.7        172          326.          654.           129.  
##  7 Test  WS        174.         177          260.          596.           124.  
##  8 Test  PS        180.         251           18.3           5.73          17.2 
##  9 Test  WS         18.9        495           16.1          89.5           13.7 
## 10 Test  WS        153.         384           17.7          89.9           20.4 
## # ... with 2,009 more rows, and 51 more variables: avg_inten_ch_4 <dbl>,
## #   convex_hull_area_ratio_ch_1 <dbl>, convex_hull_perim_ratio_ch_1 <dbl>,
## #   diff_inten_density_ch_1 <dbl>, diff_inten_density_ch_3 <dbl>,
## #   diff_inten_density_ch_4 <dbl>, entropy_inten_ch_1 <dbl>,
## #   entropy_inten_ch_3 <dbl>, entropy_inten_ch_4 <dbl>,
## #   eq_circ_diam_ch_1 <dbl>, eq_ellipse_lwr_ch_1 <dbl>,
## #   eq_ellipse_oblate_vol_ch_1 <dbl>, eq_ellipse_prolate_vol_ch_1 <dbl>,
## #   eq_sphere_area_ch_1 <dbl>, eq_sphere_vol_ch_1 <dbl>,
## #   fiber_align_2_ch_3 <dbl>, fiber_align_2_ch_4 <dbl>,
## #   fiber_length_ch_1 <dbl>, fiber_width_ch_1 <dbl>, inten_cooc_asm_ch_3 <dbl>,
## #   inten_cooc_asm_ch_4 <dbl>, inten_cooc_contrast_ch_3 <dbl>,
## #   inten_cooc_contrast_ch_4 <dbl>, inten_cooc_entropy_ch_3 <dbl>,
## #   inten_cooc_entropy_ch_4 <dbl>, inten_cooc_max_ch_3 <dbl>,
## #   inten_cooc_max_ch_4 <dbl>, kurt_inten_ch_1 <dbl>, kurt_inten_ch_3 <dbl>,
## #   kurt_inten_ch_4 <dbl>, length_ch_1 <dbl>, neighbor_avg_dist_ch_1 <dbl>,
## #   neighbor_min_dist_ch_1 <dbl>, neighbor_var_dist_ch_1 <dbl>,
## #   perim_ch_1 <dbl>, shape_bfr_ch_1 <dbl>, shape_lwr_ch_1 <dbl>,
## #   shape_p_2_a_ch_1 <dbl>, skew_inten_ch_1 <dbl>, skew_inten_ch_3 <dbl>,
## #   skew_inten_ch_4 <dbl>, spot_fiber_count_ch_3 <int>,
## #   spot_fiber_count_ch_4 <dbl>, total_inten_ch_1 <int>,
## #   total_inten_ch_2 <dbl>, total_inten_ch_3 <int>, total_inten_ch_4 <int>,
## #   var_inten_ch_1 <dbl>, var_inten_ch_3 <dbl>, var_inten_ch_4 <dbl>,
## #   width_ch_1 <dbl>

We used a Random forest model to predict and classify the quality of image segmentation in cells between well segmented (WS) and poorly segmented (PS) in our previous practice, with 10-fold Cross Validation resampling method to evaluate its performance.

#Predicting Image Segmentation, but better----

One limitation of random forest is that it performs well only with default hyperparameters. However, other tree-based models such as boosted tree and/or decision tree can be sensitive to hyperparameters.
In this practice, we will train and tune hyperparameters of a decision tree model for better performance. Examples of such hyperparameters are

  1. the complexity parameter for the tree (cost_complexity), and
  2. the maximum tree_depth. Tuning these hyperparameters can improve model performance by preventing decision tree models from overfitting, which is when a model fits a particular set of data very well, at the expense of its potential in predicting new data.

Tuning the value of cost_complexity helps by prunning back our tree by adding a cost (or penalty) to reduce error rates of more complex trees. The lower the error penalty we have, the more likely that our model will become overfit due to the low number of optimal (prunned) sub-tree. However, if the penalty cost is extremely high, the model could be underfit, the opposite extreme of overfit.

Another alternative, or addition, to tuning the cost_complexity is by tuning the tree_depth, which early stops the tree from growing after a certain point to prevent overfit.

We will try to tune the two hyperparameters to find the optimal value for our model for maximum performance in predicting image segmentation.


As usual, we will split our data into both training and testing data sets. We will also use strata = class to have the two data sets share the same proportion of class variable, which we are interested in, as in the original data set.

set.seed(123)
cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

We will use the train data for parameter tuning.


#Tuning hyperparameters----

We will create a decision tree model from the parsnip package with the rpart engine to identify the hyperparameters we plan to tune, cost_complexity and tree_depth.

tune_spec <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

tune_spec
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   tree_depth = tune()
## 
## Computational engine: rpart

We use tune()to identify hyperparameters that we plan to tune when specifying the model. Basically, it’s like letting the program knows that we will tune cost_complexity and tree_depth.

While we can not find out the optimal value for both hyperparameters by training once on the whole train data set, we can train multiple models from using resampled data to find out the optimal value.
We can create a grid of values for both hyperparameters through dial package.

tree_grid <- grid_regular(cost_complexity(), #identifying the hyperparameters
                          tree_depth(),
                          levels = 5)

grid_regular chooses sensible values to try for each hyperparameter. We asked for 5 of each.
Since we have two to hyperparameters, the function returns 5x5 = 25 different possible tuning combinations to try in a tidy tibble format.

tree_grid  
## # A tibble: 25 x 2
##    cost_complexity tree_depth
##              <dbl>      <int>
##  1    0.0000000001          1
##  2    0.0000000178          1
##  3    0.00000316            1
##  4    0.000562              1
##  5    0.1                   1
##  6    0.0000000001          4
##  7    0.0000000178          4
##  8    0.00000316            4
##  9    0.000562              4
## 10    0.1                   4
## # ... with 15 more rows

Here, you can see all 5 values of cost_complexity ranging up to 0.1. These values get repeated for each of the 5 values of tree_depth.

tree_grid %>% 
  count(tree_depth)
## # A tibble: 5 x 2
##   tree_depth     n
##        <int> <int>
## 1          1     5
## 2          4     5
## 3          8     5
## 4         11     5
## 5         15     5

Now that we have 25 variations of decision tree models, we can do the Cross validation fold to resample before tuning:

set.seed(234)
cell_folds <- vfold_cv(cell_train)

#Model tuning with a grid----

Next, we will fit the model at all the different values we chose for each tuned hyperparameter. We can either tune a model specification along with a recipe or model, or tune a workflow() that bundles together a model specification and a recipe or model preprocessor. For this practice, we choose to do the latter.

Here we use a workflow() with a straightforward formula; if this model required more involved data preprocessing, we could use add_recipe() instead of add_formula().

set.seed(345)

tree_wf <- workflow() %>%
  add_model(tune_spec) %>%
  add_formula(class ~ .)

tree_res <- 
  tree_wf %>% 
  tune_grid(
    resamples = cell_folds,
    grid = tree_grid
  )

tree_res 
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits             id     .metrics          .notes          
##    <list>             <chr>  <list>            <list>          
##  1 <split [1.4K/152]> Fold01 <tibble [50 x 5]> <tibble [0 x 1]>
##  2 <split [1.4K/152]> Fold02 <tibble [50 x 5]> <tibble [0 x 1]>
##  3 <split [1.4K/152]> Fold03 <tibble [50 x 5]> <tibble [0 x 1]>
##  4 <split [1.4K/152]> Fold04 <tibble [50 x 5]> <tibble [0 x 1]>
##  5 <split [1.4K/152]> Fold05 <tibble [50 x 5]> <tibble [0 x 1]>
##  6 <split [1.4K/151]> Fold06 <tibble [50 x 5]> <tibble [0 x 1]>
##  7 <split [1.4K/151]> Fold07 <tibble [50 x 5]> <tibble [0 x 1]>
##  8 <split [1.4K/151]> Fold08 <tibble [50 x 5]> <tibble [0 x 1]>
##  9 <split [1.4K/151]> Fold09 <tibble [50 x 5]> <tibble [0 x 1]>
## 10 <split [1.4K/151]> Fold10 <tibble [50 x 5]> <tibble [0 x 1]>

Once we have our tuning results, we can both explore them through visualization and then select the best result.

The function collect_metrics() gives us a tidy tibble with all the results.
We had 25 candidate models and two metrics, accuracy and roc_auc, and we get a row for each .metric and model.

tree_res %>% 
  collect_metrics()
## # A tibble: 50 x 7
##    cost_complexity tree_depth .metric  .estimator  mean     n std_err
##              <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl>
##  1    0.0000000001          1 accuracy binary     0.734    10 0.00877
##  2    0.0000000001          1 roc_auc  binary     0.772    10 0.00617
##  3    0.0000000001          4 accuracy binary     0.804    10 0.00696
##  4    0.0000000001          4 roc_auc  binary     0.865    10 0.00965
##  5    0.0000000001          8 accuracy binary     0.792    10 0.0116 
##  6    0.0000000001          8 roc_auc  binary     0.859    10 0.0104 
##  7    0.0000000001         11 accuracy binary     0.787    10 0.0134 
##  8    0.0000000001         11 roc_auc  binary     0.854    10 0.0118 
##  9    0.0000000001         15 accuracy binary     0.783    10 0.0129 
## 10    0.0000000001         15 roc_auc  binary     0.856    10 0.0116 
## # ... with 40 more rows

We might get more out of plotting these results:

tree_res %>%
  collect_metrics() %>%
  mutate(tree_depth = factor(tree_depth)) %>%
  ggplot(aes(cost_complexity, mean, color = tree_depth)) +
  geom_line(size = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2) +
  #Transform the x-avis into log10 format
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0)

From the plot above, the tree with the depth of 1 performs worst in both accuracy index and AUC across all values of cost_complexity.
The deepest tree performs better than the shallowest tree, with the best candidate of the tree with the depth of 4.

The show_best() function shows us the top 5 candidate models by default:

tree_res %>%
  show_best("roc_auc")
## # A tibble: 5 x 7
##   cost_complexity tree_depth .metric .estimator  mean     n std_err
##             <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          4 roc_auc binary     0.865    10 0.00965
## 2    0.0000000178          4 roc_auc binary     0.865    10 0.00965
## 3    0.00000316            4 roc_auc binary     0.865    10 0.00965
## 4    0.000562              4 roc_auc binary     0.865    10 0.00965
## 5    0.0000000001          8 roc_auc binary     0.859    10 0.0104
tree_res %>%
  show_best("accuracy")
## # A tibble: 5 x 7
##   cost_complexity tree_depth .metric  .estimator  mean     n std_err
##             <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          4 accuracy binary     0.804    10 0.00696
## 2    0.0000000178          4 accuracy binary     0.804    10 0.00696
## 3    0.00000316            4 accuracy binary     0.804    10 0.00696
## 4    0.000562              4 accuracy binary     0.804    10 0.00696
## 5    0.0000000001          8 accuracy binary     0.792    10 0.0116

We can also use the select_best() function to pull out the single set of hyperparameter values for our best decision tree model:

best_tree <- tree_res %>%
  select_best(metric = "roc_auc")

best_tree
## # A tibble: 1 x 2
##   cost_complexity tree_depth
##             <dbl>      <int>
## 1    0.0000000001          4

These are the values for tree_depth and cost_complexity that maximize AUC in this data set of cell images.


#Finalizing our model----

We can update (or “finalize”) our workflow object tree_wf with the values from select_best().

final_wf <- 
  tree_wf %>% 
  finalize_workflow(best_tree)

final_wf
## == Workflow ==============================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
## 
## -- Preprocessor ----------------------------------------------------------------------------------------------------------
## class ~ .
## 
## -- Model -----------------------------------------------------------------------------------------------------------------
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 1e-10
##   tree_depth = 4
## 
## Computational engine: rpart

Our tuning is done!

Let’s explore the result by fitting the finalized workflow, which comprises of model and formula, into the training data:

final_tree <- 
  final_wf %>%
  fit(data = cell_train) 

final_tree
## == Workflow [trained] ====================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
## 
## -- Preprocessor ----------------------------------------------------------------------------------------------------------
## class ~ .
## 
## -- Model -----------------------------------------------------------------------------------------------------------------
## n= 1515 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1515 540 PS (0.64356436 0.35643564)  
##    2) total_inten_ch_2< 47256.5 731  63 PS (0.91381669 0.08618331)  
##      4) total_inten_ch_2< 37166 585  19 PS (0.96752137 0.03247863) *
##      5) total_inten_ch_2>=37166 146  44 PS (0.69863014 0.30136986)  
##       10) avg_inten_ch_1< 99.15056 98  14 PS (0.85714286 0.14285714) *
##       11) avg_inten_ch_1>=99.15056 48  18 WS (0.37500000 0.62500000)  
##         22) fiber_align_2_ch_3>=1.47949 20   8 PS (0.60000000 0.40000000) *
##         23) fiber_align_2_ch_3< 1.47949 28   6 WS (0.21428571 0.78571429) *
##    3) total_inten_ch_2>=47256.5 784 307 WS (0.39158163 0.60841837)  
##      6) fiber_width_ch_1< 11.19756 329 137 PS (0.58358663 0.41641337)  
##       12) avg_inten_ch_1< 194.4183 254  82 PS (0.67716535 0.32283465) *
##       13) avg_inten_ch_1>=194.4183 75  20 WS (0.26666667 0.73333333)  
##         26) total_inten_ch_3>=62458.5 23   9 PS (0.60869565 0.39130435) *
##         27) total_inten_ch_3< 62458.5 52   6 WS (0.11538462 0.88461538) *
##      7) fiber_width_ch_1>=11.19756 455 115 WS (0.25274725 0.74725275)  
##       14) shape_p_2_a_ch_1>=1.225676 300  97 WS (0.32333333 0.67666667)  
##         28) avg_inten_ch_2>=362.0108 55  23 PS (0.58181818 0.41818182) *
##         29) avg_inten_ch_2< 362.0108 245  65 WS (0.26530612 0.73469388) *
##       15) shape_p_2_a_ch_1< 1.225676 155  18 WS (0.11612903 0.88387097) *

The final tree has been finalized and fitted with the model object inside. We can use pull_workflow_fit() to extract the model object inside for investigtion. For example, perhaps we would also like to understand what variables are important in this final model. We can use the vip package to estimate variable importance.

final_tree %>% 
  pull_workflow_fit() %>% 
  vip()

These are the automated image analysis measurements that are the most important in driving segmentation quality predictions.

Now we are going to test our finalized model with the new data.

The function last_fit fits the finalized model on the full training data set and evaluates the finalized model on the testing data.

final_fit <- 
  final_wf %>%
  last_fit(cell_split) 

final_fit %>%
  collect_metrics()
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.802
## 2 roc_auc  binary         0.860

Plot the ROC curve

final_fit %>%
  collect_predictions() %>% 
  roc_curve(class, .pred_PS) %>% 
  autoplot()

The performance metrics from the test set indicate that we did not overfit during our tuning procedure.

We can also tune min n, which sets the minimum n to split at a node as another early stopping method to prevent overfit.