setwd("D:/Class Materials & Work/Summer 2020 practice/4_Tidymodel_Parameter tuning")
getwd()
## [1] "D:/Class Materials & Work/Summer 2020 practice/4_Tidymodel_Parameter tuning"
In building a model, hyperparameters such as the number of predictors that are sampled at splits in a tree-based model (mtry) or the learning rate in a boosted tree model (learn_rate) can not be learned direcyly from a data set during model training.
Instead, these hyperparameters are estimated their best value from multiple model trainings on resampled data set, as well as exploring how well all these models perform. This process is called Parameter Tuning.
Loading required packages
library(tidymodels) # for the tune package, along with the rest of tidymodels
library(modeldata) # for the cells data
library(vip) # for variable importance plots
#Cell Image Data-Revisited----
Loading the cell data set.
data(cells, package = "modeldata")
cells
## # A tibble: 2,019 x 58
## case class angle_ch_1 area_ch_1 avg_inten_ch_1 avg_inten_ch_2 avg_inten_ch_3
## <fct> <fct> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Test PS 143. 185 15.7 4.95 9.55
## 2 Train PS 134. 819 31.9 207. 69.9
## 3 Train WS 107. 431 28.0 116. 63.9
## 4 Train PS 69.2 298 19.5 102. 28.2
## 5 Test PS 2.89 285 24.3 112. 20.5
## 6 Test WS 40.7 172 326. 654. 129.
## 7 Test WS 174. 177 260. 596. 124.
## 8 Test PS 180. 251 18.3 5.73 17.2
## 9 Test WS 18.9 495 16.1 89.5 13.7
## 10 Test WS 153. 384 17.7 89.9 20.4
## # ... with 2,009 more rows, and 51 more variables: avg_inten_ch_4 <dbl>,
## # convex_hull_area_ratio_ch_1 <dbl>, convex_hull_perim_ratio_ch_1 <dbl>,
## # diff_inten_density_ch_1 <dbl>, diff_inten_density_ch_3 <dbl>,
## # diff_inten_density_ch_4 <dbl>, entropy_inten_ch_1 <dbl>,
## # entropy_inten_ch_3 <dbl>, entropy_inten_ch_4 <dbl>,
## # eq_circ_diam_ch_1 <dbl>, eq_ellipse_lwr_ch_1 <dbl>,
## # eq_ellipse_oblate_vol_ch_1 <dbl>, eq_ellipse_prolate_vol_ch_1 <dbl>,
## # eq_sphere_area_ch_1 <dbl>, eq_sphere_vol_ch_1 <dbl>,
## # fiber_align_2_ch_3 <dbl>, fiber_align_2_ch_4 <dbl>,
## # fiber_length_ch_1 <dbl>, fiber_width_ch_1 <dbl>, inten_cooc_asm_ch_3 <dbl>,
## # inten_cooc_asm_ch_4 <dbl>, inten_cooc_contrast_ch_3 <dbl>,
## # inten_cooc_contrast_ch_4 <dbl>, inten_cooc_entropy_ch_3 <dbl>,
## # inten_cooc_entropy_ch_4 <dbl>, inten_cooc_max_ch_3 <dbl>,
## # inten_cooc_max_ch_4 <dbl>, kurt_inten_ch_1 <dbl>, kurt_inten_ch_3 <dbl>,
## # kurt_inten_ch_4 <dbl>, length_ch_1 <dbl>, neighbor_avg_dist_ch_1 <dbl>,
## # neighbor_min_dist_ch_1 <dbl>, neighbor_var_dist_ch_1 <dbl>,
## # perim_ch_1 <dbl>, shape_bfr_ch_1 <dbl>, shape_lwr_ch_1 <dbl>,
## # shape_p_2_a_ch_1 <dbl>, skew_inten_ch_1 <dbl>, skew_inten_ch_3 <dbl>,
## # skew_inten_ch_4 <dbl>, spot_fiber_count_ch_3 <int>,
## # spot_fiber_count_ch_4 <dbl>, total_inten_ch_1 <int>,
## # total_inten_ch_2 <dbl>, total_inten_ch_3 <int>, total_inten_ch_4 <int>,
## # var_inten_ch_1 <dbl>, var_inten_ch_3 <dbl>, var_inten_ch_4 <dbl>,
## # width_ch_1 <dbl>
We used a Random forest model to predict and classify the quality of image segmentation in cells between well segmented (WS) and poorly segmented (PS) in our previous practice, with 10-fold Cross Validation resampling method to evaluate its performance.
#Predicting Image Segmentation, but better----
One limitation of random forest is that it performs well only with default hyperparameters. However, other tree-based models such as boosted tree and/or decision tree can be sensitive to hyperparameters.
In this practice, we will train and tune hyperparameters of a decision tree model for better performance. Examples of such hyperparameters are
cost_complexity), andtree_depth. Tuning these hyperparameters can improve model performance by preventing decision tree models from overfitting, which is when a model fits a particular set of data very well, at the expense of its potential in predicting new data.Tuning the value of cost_complexity helps by prunning back our tree by adding a cost (or penalty) to reduce error rates of more complex trees. The lower the error penalty we have, the more likely that our model will become overfit due to the low number of optimal (prunned) sub-tree. However, if the penalty cost is extremely high, the model could be underfit, the opposite extreme of overfit.
Another alternative, or addition, to tuning the cost_complexity is by tuning the tree_depth, which early stops the tree from growing after a certain point to prevent overfit.
We will try to tune the two hyperparameters to find the optimal value for our model for maximum performance in predicting image segmentation.
As usual, we will split our data into both training and testing data sets. We will also use strata = class to have the two data sets share the same proportion of class variable, which we are interested in, as in the original data set.
set.seed(123)
cell_split <- initial_split(cells %>% select(-case),
strata = class)
cell_train <- training(cell_split)
cell_test <- testing(cell_split)
We will use the train data for parameter tuning.
#Tuning hyperparameters----
We will create a decision tree model from the parsnip package with the rpart engine to identify the hyperparameters we plan to tune, cost_complexity and tree_depth.
tune_spec <-
decision_tree(
cost_complexity = tune(),
tree_depth = tune()
) %>%
set_engine("rpart") %>%
set_mode("classification")
tune_spec
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = tune()
## tree_depth = tune()
##
## Computational engine: rpart
We use tune()to identify hyperparameters that we plan to tune when specifying the model. Basically, it’s like letting the program knows that we will tune cost_complexity and tree_depth.
While we can not find out the optimal value for both hyperparameters by training once on the whole train data set, we can train multiple models from using resampled data to find out the optimal value.
We can create a grid of values for both hyperparameters through dial package.
tree_grid <- grid_regular(cost_complexity(), #identifying the hyperparameters
tree_depth(),
levels = 5)
grid_regular chooses sensible values to try for each hyperparameter. We asked for 5 of each.
Since we have two to hyperparameters, the function returns 5x5 = 25 different possible tuning combinations to try in a tidy tibble format.
tree_grid
## # A tibble: 25 x 2
## cost_complexity tree_depth
## <dbl> <int>
## 1 0.0000000001 1
## 2 0.0000000178 1
## 3 0.00000316 1
## 4 0.000562 1
## 5 0.1 1
## 6 0.0000000001 4
## 7 0.0000000178 4
## 8 0.00000316 4
## 9 0.000562 4
## 10 0.1 4
## # ... with 15 more rows
Here, you can see all 5 values of cost_complexity ranging up to 0.1. These values get repeated for each of the 5 values of tree_depth.
tree_grid %>%
count(tree_depth)
## # A tibble: 5 x 2
## tree_depth n
## <int> <int>
## 1 1 5
## 2 4 5
## 3 8 5
## 4 11 5
## 5 15 5
Now that we have 25 variations of decision tree models, we can do the Cross validation fold to resample before tuning:
set.seed(234)
cell_folds <- vfold_cv(cell_train)
#Model tuning with a grid----
Next, we will fit the model at all the different values we chose for each tuned hyperparameter. We can either tune a model specification along with a recipe or model, or tune a workflow() that bundles together a model specification and a recipe or model preprocessor. For this practice, we choose to do the latter.
Here we use a workflow() with a straightforward formula; if this model required more involved data preprocessing, we could use add_recipe() instead of add_formula().
set.seed(345)
tree_wf <- workflow() %>%
add_model(tune_spec) %>%
add_formula(class ~ .)
tree_res <-
tree_wf %>%
tune_grid(
resamples = cell_folds,
grid = tree_grid
)
tree_res
## # 10-fold cross-validation
## # A tibble: 10 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [1.4K/152]> Fold01 <tibble [50 x 5]> <tibble [0 x 1]>
## 2 <split [1.4K/152]> Fold02 <tibble [50 x 5]> <tibble [0 x 1]>
## 3 <split [1.4K/152]> Fold03 <tibble [50 x 5]> <tibble [0 x 1]>
## 4 <split [1.4K/152]> Fold04 <tibble [50 x 5]> <tibble [0 x 1]>
## 5 <split [1.4K/152]> Fold05 <tibble [50 x 5]> <tibble [0 x 1]>
## 6 <split [1.4K/151]> Fold06 <tibble [50 x 5]> <tibble [0 x 1]>
## 7 <split [1.4K/151]> Fold07 <tibble [50 x 5]> <tibble [0 x 1]>
## 8 <split [1.4K/151]> Fold08 <tibble [50 x 5]> <tibble [0 x 1]>
## 9 <split [1.4K/151]> Fold09 <tibble [50 x 5]> <tibble [0 x 1]>
## 10 <split [1.4K/151]> Fold10 <tibble [50 x 5]> <tibble [0 x 1]>
Once we have our tuning results, we can both explore them through visualization and then select the best result.
The function collect_metrics() gives us a tidy tibble with all the results.
We had 25 candidate models and two metrics, accuracy and roc_auc, and we get a row for each .metric and model.
tree_res %>%
collect_metrics()
## # A tibble: 50 x 7
## cost_complexity tree_depth .metric .estimator mean n std_err
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 1 accuracy binary 0.734 10 0.00877
## 2 0.0000000001 1 roc_auc binary 0.772 10 0.00617
## 3 0.0000000001 4 accuracy binary 0.804 10 0.00696
## 4 0.0000000001 4 roc_auc binary 0.865 10 0.00965
## 5 0.0000000001 8 accuracy binary 0.792 10 0.0116
## 6 0.0000000001 8 roc_auc binary 0.859 10 0.0104
## 7 0.0000000001 11 accuracy binary 0.787 10 0.0134
## 8 0.0000000001 11 roc_auc binary 0.854 10 0.0118
## 9 0.0000000001 15 accuracy binary 0.783 10 0.0129
## 10 0.0000000001 15 roc_auc binary 0.856 10 0.0116
## # ... with 40 more rows
We might get more out of plotting these results:
tree_res %>%
collect_metrics() %>%
mutate(tree_depth = factor(tree_depth)) %>%
ggplot(aes(cost_complexity, mean, color = tree_depth)) +
geom_line(size = 1.5, alpha = 0.6) +
geom_point(size = 2) +
facet_wrap(~ .metric, scales = "free", nrow = 2) +
#Transform the x-avis into log10 format
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option = "plasma", begin = .9, end = 0)
From the plot above, the tree with the depth of 1 performs worst in both accuracy index and AUC across all values of cost_complexity.
The deepest tree performs better than the shallowest tree, with the best candidate of the tree with the depth of 4.
The show_best() function shows us the top 5 candidate models by default:
tree_res %>%
show_best("roc_auc")
## # A tibble: 5 x 7
## cost_complexity tree_depth .metric .estimator mean n std_err
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 4 roc_auc binary 0.865 10 0.00965
## 2 0.0000000178 4 roc_auc binary 0.865 10 0.00965
## 3 0.00000316 4 roc_auc binary 0.865 10 0.00965
## 4 0.000562 4 roc_auc binary 0.865 10 0.00965
## 5 0.0000000001 8 roc_auc binary 0.859 10 0.0104
tree_res %>%
show_best("accuracy")
## # A tibble: 5 x 7
## cost_complexity tree_depth .metric .estimator mean n std_err
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 4 accuracy binary 0.804 10 0.00696
## 2 0.0000000178 4 accuracy binary 0.804 10 0.00696
## 3 0.00000316 4 accuracy binary 0.804 10 0.00696
## 4 0.000562 4 accuracy binary 0.804 10 0.00696
## 5 0.0000000001 8 accuracy binary 0.792 10 0.0116
We can also use the select_best() function to pull out the single set of hyperparameter values for our best decision tree model:
best_tree <- tree_res %>%
select_best(metric = "roc_auc")
best_tree
## # A tibble: 1 x 2
## cost_complexity tree_depth
## <dbl> <int>
## 1 0.0000000001 4
These are the values for tree_depth and cost_complexity that maximize AUC in this data set of cell images.
#Finalizing our model----
We can update (or “finalize”) our workflow object tree_wf with the values from select_best().
final_wf <-
tree_wf %>%
finalize_workflow(best_tree)
final_wf
## == Workflow ==============================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
##
## -- Preprocessor ----------------------------------------------------------------------------------------------------------
## class ~ .
##
## -- Model -----------------------------------------------------------------------------------------------------------------
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 1e-10
## tree_depth = 4
##
## Computational engine: rpart
Our tuning is done!
Let’s explore the result by fitting the finalized workflow, which comprises of model and formula, into the training data:
final_tree <-
final_wf %>%
fit(data = cell_train)
final_tree
## == Workflow [trained] ====================================================================================================
## Preprocessor: Formula
## Model: decision_tree()
##
## -- Preprocessor ----------------------------------------------------------------------------------------------------------
## class ~ .
##
## -- Model -----------------------------------------------------------------------------------------------------------------
## n= 1515
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1515 540 PS (0.64356436 0.35643564)
## 2) total_inten_ch_2< 47256.5 731 63 PS (0.91381669 0.08618331)
## 4) total_inten_ch_2< 37166 585 19 PS (0.96752137 0.03247863) *
## 5) total_inten_ch_2>=37166 146 44 PS (0.69863014 0.30136986)
## 10) avg_inten_ch_1< 99.15056 98 14 PS (0.85714286 0.14285714) *
## 11) avg_inten_ch_1>=99.15056 48 18 WS (0.37500000 0.62500000)
## 22) fiber_align_2_ch_3>=1.47949 20 8 PS (0.60000000 0.40000000) *
## 23) fiber_align_2_ch_3< 1.47949 28 6 WS (0.21428571 0.78571429) *
## 3) total_inten_ch_2>=47256.5 784 307 WS (0.39158163 0.60841837)
## 6) fiber_width_ch_1< 11.19756 329 137 PS (0.58358663 0.41641337)
## 12) avg_inten_ch_1< 194.4183 254 82 PS (0.67716535 0.32283465) *
## 13) avg_inten_ch_1>=194.4183 75 20 WS (0.26666667 0.73333333)
## 26) total_inten_ch_3>=62458.5 23 9 PS (0.60869565 0.39130435) *
## 27) total_inten_ch_3< 62458.5 52 6 WS (0.11538462 0.88461538) *
## 7) fiber_width_ch_1>=11.19756 455 115 WS (0.25274725 0.74725275)
## 14) shape_p_2_a_ch_1>=1.225676 300 97 WS (0.32333333 0.67666667)
## 28) avg_inten_ch_2>=362.0108 55 23 PS (0.58181818 0.41818182) *
## 29) avg_inten_ch_2< 362.0108 245 65 WS (0.26530612 0.73469388) *
## 15) shape_p_2_a_ch_1< 1.225676 155 18 WS (0.11612903 0.88387097) *
The final tree has been finalized and fitted with the model object inside. We can use pull_workflow_fit() to extract the model object inside for investigtion. For example, perhaps we would also like to understand what variables are important in this final model. We can use the vip package to estimate variable importance.
final_tree %>%
pull_workflow_fit() %>%
vip()
These are the automated image analysis measurements that are the most important in driving segmentation quality predictions.
Now we are going to test our finalized model with the new data.
The function last_fit fits the finalized model on the full training data set and evaluates the finalized model on the testing data.
final_fit <-
final_wf %>%
last_fit(cell_split)
final_fit %>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.802
## 2 roc_auc binary 0.860
Plot the ROC curve
final_fit %>%
collect_predictions() %>%
roc_curve(class, .pred_PS) %>%
autoplot()
The performance metrics from the test set indicate that we did not overfit during our tuning procedure.
We can also tune min n, which sets the minimum n to split at a node as another early stopping method to prevent overfit.