Even after feature engineering, machine learning approaches can often (but not always) be improved by choosing a more sophisticated model type. Note how we used a regression model in the first two case studies; here, we explore a considerably more sophisticated model, a random forest.
Choosing a more sophisticated model adds some complexity to the modeling. Notably, more complex models have tuning parameters - parts of the model that are not estimated from the data. In addition to using feature engineering in a way akin to how we did in the last case study, Bertolini et al. (2021) use tuning parameters to improve the performance of their predictive model.
Bertolini, R., Finch, S. J., & Nehm, R. H. (2021). Enhancing data pipelines for forecasting student performance: integrating feature selection with cross-validation. International Journal of Educational Technology in Higher Education, 18(1), 1-23. https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-3/bertolini-et-al-2021-ijethe.pdf
Our driving question is: How much of a difference does a more complex model make? Looking back to our predictive model from LL1, we can see that our accuracy was around 87%: 0.872, more specifically. Can we improve on that?
While answering this question, we focus not only on estimating, but also on tuning a complex model. The data we use is, again, from the #NGSSchat community on Twitter, as in doing so we can compare the performance of this tuned, complex model to the initial model we used in the first case study.
First, let’s load the packages we’ll use—the familiar {tidyverse} and several others focused on modeling. Like in earlier learning labs, click the green arrow to run the code chunk.
Please add to the chunk below code to load three packages we’ve used in both LL1 and LL2 - tidyverse, tidymodels, and here.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom 0.7.12 ✔ rsample 1.0.0
## ✔ dials 1.0.0 ✔ tune 1.0.0
## ✔ infer 1.0.2 ✔ workflows 1.0.0
## ✔ modeldata 1.0.0 ✔ workflowsets 0.2.1
## ✔ parsnip 1.0.0 ✔ yardstick 1.0.0
## ✔ recipes 1.0.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(vip) # a new package we're adding for variable importance measures
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
library(ranger) # this is needed for the random forest algorithm
library(here)
## here() starts at /Users/meinazhu/Documents/GitHub/machine-learning
Next, we’ll load the processed data.
Note: We created a means of visualizing the threads to make coding them easier; that’s here and it provides a means of seeing what the raw data is like: https://jmichaelrosenberg.shinyapps.io/ngsschat-shiny/.
We’ve added three additional variables for this analysis; thus, the variables we have to consider to use as features are:
n: The number of tweets in the thread
(independent variable)mean_favorite_count: The mean number of favorites for
the tweets in the thread (independent variable)sum_favorite_count: The sum of the number of favorites
for the tweets in the thread (independent variable)mean_retweet_count: The mean number of retweets for the
tweets in the thread (independent variable)sum_retweet_count: The sum of the number of retweets
for the tweets in the thread (independent variable)sum_display_text_width: The sum of the number of
characters for the tweets in the thread (independent
variable)mean_display_text_width: The mean of the number of
characters for the tweets in the thread (independent
variable)code: The qualitative code (TS = transactional; TF =
transformational) (dependent variable)One big type of feature not included in this analysis - more information on the text in the tweets. This is likely to be quite predictive: the words that users included in their tweets are probably associated with (and predictive of) substantive or transactional conversations. Given our focus on ML in this topic area, we do not include features relating to the text data, but think this could be a great direction for future work (and research) in this area.
d <- read_csv("data/ngsschat-processed-data.csv")
## Rows: 3793 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): code
## dbl (4): mean_favorite_count, mean_retweet_count, sum_display_text_width, n
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d <- d %>%
mutate(code = as.factor(code)) # this is needed for the classification mode
We treat this step relatively minimally as we have now carried out a step very similar to this in LL1 and LL2; return to the case study for those (especially LL1) for more on data splitting. Note that we carry out the k-folds cross-validation process introduced in LL2. Consider - like there - setting a different value for v (k) as you think is appropriate.
You do have one step that is your turn! Please add the code for setting up the k-folds cross-validation (exactly as you did this in LL2.
set.seed(20220712)
train_test_split <- initial_split(d, prop = .80)
data_train <- training(train_test_split)
#<replace this line with your kfcv code!>
kfcv <- vfold_cv(data_train, v = 8)
In Step 1, we noted how we added three variables as potential features. Here, we carry out two feature engineering steps we have carried out before - standardizing the numeric variables (to have a mean equal to 0 and a standard deviation equal to 1) and dropping any features with near-zero variance. Consider adding other feature engineering steps - perhaps the step you carried out complete the badge requirements for LL2.
Add a feature engineering step below. Consider those described here.
my_rec <- recipe(code ~ ., data = data_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_nzv(all_predictors()) %>%
step_dummy(all_nominal_predictors()) %>% # dummy code all factor variables
step_impute_knn(all_predictors()) # impute missing data for all predictor variables
There are several steps that are different from the past learning labs here.
random_forest() function to set the
model as a random forestset_engine("ranger", importance = "impurity") to
set the engine as that provided for random forests through the
{ranger} package; we also add the importance = "impurity"
line to be able to interpret a particular variable importance metric
(impurity) specific to random forest modelsset_mode("classification")) as we are
again predicting categories (transactional and substantive conversations
taking place through #NGSSchat)# specify model
my_mod <-
rand_forest(mtry = tune(),
min_n = tune(),
trees = tune ()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
# specify workflow
my_wf <-
workflow() %>%
add_model(my_mod) %>%
add_recipe(my_rec)
Here, things become are different once again. We’ll follow a grid
method to specify two tuning parameters, the number of predictor
variables that are randomly sampled for each split in the tree
(mtry) and the number of data points required to execute a
split (min_n). size refers to how many
distinct combinations of the tuning parameters will be returned. 10 is a
relatively small number - we can imagine a much larger number of
combinations of the mtry and min_n
hyperparameters - but it should give us a sense of what parameters lead
to the best performance.
These next two functions are used to get a sense of what the values
for mtry and min_n should be based on the
dimensions or range of the values of the variables in the data.
finalize(mtry(), data_train)
## # Randomly Selected Predictors (quantitative)
## Range: [1, 5]
finalize(min_n(), data_train)
## Minimal Node Size (quantitative)
## Range: [2, 40]
finalize(trees(), data_train)
## # Trees (quantitative)
## Range: [1, 2000]
We then use these values in the grid_max_entropy()
function below; replace the xx values below with the
maximum value provided by the mtry and
min_n variables, above. You can see that
tree_grid is simply a combination of values of the
hyperparameters.
tree_grid <- grid_max_entropy(mtry(range(1, 5)),
min_n(range(2, 40)),
trees(range(1,600)),
size = 10)
tree_grid
## # A tibble: 10 × 3
## mtry min_n trees
## <int> <int> <int>
## 1 3 30 358
## 2 4 32 545
## 3 1 2 567
## 4 5 3 141
## 5 1 34 586
## 6 1 24 86
## 7 5 6 486
## 8 2 7 193
## 9 5 34 239
## 10 4 26 59
Now, we’re ready to fit the model. Note - this can take some time,
the longest of any function we’ve run so far, as we’re estimating a) as
many models as there are folds (v = 10 as default) and b)
distinct combinations of hyperparameters (size = 10 as
default).
# fit model with tune_grid()
fitted_model <- my_wf %>%
tune_grid(
resamples = kfcv,
grid = tree_grid,
metrics = metric_set(roc_auc, accuracy, kap, sensitivity, specificity, precision)
)
## ! Fold1: preprocessor 1/1, model 4/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold1: preprocessor 1/1, model 7/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold1: preprocessor 1/1, model 9/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold2: preprocessor 1/1, model 4/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold2: preprocessor 1/1, model 7/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold2: preprocessor 1/1, model 9/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold3: preprocessor 1/1, model 4/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold3: preprocessor 1/1, model 7/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold3: preprocessor 1/1, model 9/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold4: preprocessor 1/1, model 4/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold4: preprocessor 1/1, model 7/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold4: preprocessor 1/1, model 9/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold5: preprocessor 1/1, model 4/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold5: preprocessor 1/1, model 7/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold5: preprocessor 1/1, model 9/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold6: preprocessor 1/1, model 4/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold6: preprocessor 1/1, model 7/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold6: preprocessor 1/1, model 9/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold7: preprocessor 1/1, model 4/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold7: preprocessor 1/1, model 7/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold7: preprocessor 1/1, model 9/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold8: preprocessor 1/1, model 4/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold8: preprocessor 1/1, model 7/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
## ! Fold8: preprocessor 1/1, model 9/10: 5 columns were requested but there were 4 predictors in the data. 4 will...
Here comes some further additional steps. This next step is key technically and conceptually - we’re examining the best tuning parameters ranked by their predictive accuracy.
# examine best set of tuning parameters; repeat?
show_best(fitted_model, n = 10, metric = "accuracy")
## # A tibble: 10 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 59 26 accuracy binary 0.884 8 0.00402 Preprocessor1_Mode…
## 2 3 358 30 accuracy binary 0.881 8 0.00396 Preprocessor1_Mode…
## 3 2 193 7 accuracy binary 0.881 8 0.00565 Preprocessor1_Mode…
## 4 4 545 32 accuracy binary 0.881 8 0.00344 Preprocessor1_Mode…
## 5 1 567 2 accuracy binary 0.880 8 0.00563 Preprocessor1_Mode…
## 6 5 239 34 accuracy binary 0.879 8 0.00346 Preprocessor1_Mode…
## 7 1 86 24 accuracy binary 0.878 8 0.00525 Preprocessor1_Mode…
## 8 1 586 34 accuracy binary 0.876 8 0.00548 Preprocessor1_Mode…
## 9 5 486 6 accuracy binary 0.865 8 0.00537 Preprocessor1_Mode…
## 10 5 141 3 accuracy binary 0.862 8 0.00556 Preprocessor1_Mode…
This function simply indicates that you want to use the best of the sets of tuning parameters examined though the code in the above chunk - literally the first row.
# select best set of tuning parameters
best_tree <- fitted_model %>%
select_best(metric = "accuracy")
Next, we’ll finalize workflow with best set of tuning parameters and then fit the model on the training data.
final_wf <- my_wf %>%
finalize_workflow(best_tree)
final_fit <- final_wf %>%
last_fit(train_test_split, metrics = metric_set(roc_auc, accuracy, kap, sensitivity, specificity, precision))
We can see that final_fit is for a single fit: a random
forest with the best performing tuning parameters trained with the
entire training set of data to predict the values in our
(otherwise not used/“spent”) testing set of data.
final_fit
## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [3034/759]> train/test split <tibble> <tibble> <tibble> <workflow>
Last, we can interpret the accuracy of our tuned model.
# fit stats
final_fit %>%
collect_metrics()
## # A tibble: 6 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.877 Preprocessor1_Model1
## 2 kap binary 0.752 Preprocessor1_Model1
## 3 sensitivity binary 0.885 Preprocessor1_Model1
## 4 specificity binary 0.871 Preprocessor1_Model1
## 5 precision binary 0.842 Preprocessor1_Model1
## 6 roc_auc binary 0.932 Preprocessor1_Model1
Interpreting these - apart from accuracy - may present
some challenges. First, we can focus on the accuracy - around 89%
(0.886). Accuracy and the others are defined below.
Accuracy: For the known codes, what percentage of the predictions are correct
Cohen’s K: Same as accuracy, while account for the base rate of (chance) agreement
Sensitivity (AKA recall): Among the true “positives”, what percentage are classified as “positive”?
Specificity: Among the true “negatives”, what percentage are classified as “negative”?
ROC AUC: For different levels of the threshold, what is the sensitivity and specificity?
You’ll have the chance to interpret these further in the badge for this learning lab.
One last note - we may be interested to see which variables were most importance. We can do this with the following.
final_fit %>%
pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vip(num_features = 10)
Congratulations - you’ve completed this case study! Consider moving on to the badge activity next.