library(tidyverse)
library(tidymodels)
library(themis)
library(stacks)
library(caret)  # loaded only to bring in the GermanCredit data
library(ROCR)
library(randomForest)
library(kknn)
library(kernlab)
data(GermanCredit)

In the previous tutorial series, we walked through the “caret” package in R for machine learning. We used the raw “GermanCredit” dataset, performed a brief exploration of it, and used the package to walk through a variety of steps: pre-processing, removing low information features, tuning hyperparameters, correcting for class imbalance, and summarizing results based on metrics we deem important. Where possible, we will perform the exact same exercise here, except we will use the “tidymodels” suite of packages to do so.

There are a few sources from which this tutorial draws influence and structure.

Background on “tidymodels”

The “tidymodels” suite of packages is similar to the “tidyverse” suite of packages, in the sense that it bundles together many smaller packages into a unified framework. Each package has a different purpose. These packages are as follows:


General preparation of the dataset for machine learning

We will perform the same pre-processing steps from the “caret” tutorial - essentially, we want to only keep certain variables and rename, and generate missing data first to simulate some of the reality of real-world data.

# Select variables
GermanCredit <- GermanCredit %>%
  dplyr::select(Class, Duration, Amount, Age, ResidenceDuration,
                NumberExistingCredits, NumberPeopleMaintenance,
                Telephone, ForeignWorker, Housing.Rent, Housing.Own,
                Housing.ForFree, Property.RealEstate,
                Property.Insurance, Property.CarOther,
                Property.Unknown) %>%
  dplyr::rename("EmploymentDuration" = "Duration")
# Simulate missing data for the variables Age and Employment Duration
n <- nrow(GermanCredit)
agePct <- 3
durationPct <- 7
# Generate rows that will hold missing data
set.seed(713)
ageMissingPctRows <- sample(1:n, round(agePct/100 * n, 0))
set.seed(713)
durationMissingPctRows <- sample(1:n, round(durationPct/100 * n, 0))
# Make values NA's
GermanCredit[ageMissingPctRows, "Age"] <- NA
GermanCredit[durationMissingPctRows, "EmploymentDuration"] <- NA
# Code certain variables as factors
GermanCredit <- GermanCredit %>%
  mutate(across(.cols = c("ResidenceDuration",
                          "NumberExistingCredits",
                          "NumberPeopleMaintenance", "Telephone",
                          "ForeignWorker"), .fns = factor))

Let’s get a look at our dataset now:

summary(GermanCredit)
##   Class     EmploymentDuration     Amount           Age       
##  Bad :300   Min.   : 4.00      Min.   :  250   Min.   :19.00  
##  Good:700   1st Qu.:12.00      1st Qu.: 1366   1st Qu.:27.00  
##             Median :18.00      Median : 2320   Median :33.00  
##             Mean   :20.92      Mean   : 3271   Mean   :35.65  
##             3rd Qu.:24.00      3rd Qu.: 3972   3rd Qu.:42.00  
##             Max.   :72.00      Max.   :18424   Max.   :75.00  
##             NA's   :70                         NA's   :30     
##  ResidenceDuration NumberExistingCredits NumberPeopleMaintenance Telephone
##  1:130             1:633                 1:845                   0:404    
##  2:308             2:333                 2:155                   1:596    
##  3:149             3: 28                                                  
##  4:413             4:  6                                                  
##                                                                           
##                                                                           
##                                                                           
##  ForeignWorker  Housing.Rent    Housing.Own    Housing.ForFree
##  0: 37         Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1:963         1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
##                Median :0.000   Median :1.000   Median :0.000  
##                Mean   :0.179   Mean   :0.713   Mean   :0.108  
##                3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000  
##                Max.   :1.000   Max.   :1.000   Max.   :1.000  
##                                                               
##  Property.RealEstate Property.Insurance Property.CarOther Property.Unknown
##  Min.   :0.000       Min.   :0.000      Min.   :0.000     Min.   :0.000   
##  1st Qu.:0.000       1st Qu.:0.000      1st Qu.:0.000     1st Qu.:0.000   
##  Median :0.000       Median :0.000      Median :0.000     Median :0.000   
##  Mean   :0.282       Mean   :0.232      Mean   :0.332     Mean   :0.154   
##  3rd Qu.:1.000       3rd Qu.:0.000      3rd Qu.:1.000     3rd Qu.:0.000   
##  Max.   :1.000       Max.   :1.000      Max.   :1.000     Max.   :1.000   
## 

“Class” is our response variable, and it has a class balance of 70/30. We now have a distribution of missing values for the EmploymentDuration and Age variables that we will address later, but the rest of our predictor variables are factors. Notice that they are coded in different ways. For example, “Telephone” and “ForeignWorker” are coded as 0 vs. 1 variables, but the variable “Housing” is divided into three components: “Housing.Rent”, “Housing.Own”, and “Housing.ForFree”. We will address this during the pre-processing process.

There are some variables with very low variance. With the “caret” package, we can use the nearZeroVar() function to identify these. The process for this in “tidymodels” is called step_nzv(), and this step will be applied at the pre-processing stage.

First and foremost however, we will drop the variable “ForeignWorker” due to its relative lack of variation, and will merge levels 2, 3, and 4 of the variable NumberExistingCredits, as is done in the “caret” tutorial.

GermanCredit <- dplyr::select(GermanCredit, -ForeignWorker)
GermanCredit$NumberExistingCredits <- fct_collapse(GermanCredit$NumberExistingCredits, "2+" = c("2", "3", "4"))


rsample

We will use the initial_split(), training(), and testing() functions in order to partition our data, via a 70/30 partition, into a training set and a test set. The initial_split() function will create the splitting mechanism based on an object of class “rsplit”, and then we can use the other respective functions to create the data sets.

set.seed(713)
trainIndex <- initial_split(GermanCredit) #removed split and strata arguments due to errors. This could impact split in each set
trainingSet <- training(trainIndex)
testSet <- testing(trainIndex)

Let’s summarize the training set by itself.

summary(trainingSet)
##   Class     EmploymentDuration     Amount           Age       
##  Bad :226   Min.   : 4.00      Min.   :  276   Min.   :19.00  
##  Good:524   1st Qu.:12.00      1st Qu.: 1377   1st Qu.:27.00  
##             Median :18.00      Median : 2309   Median :33.00  
##             Mean   :20.71      Mean   : 3261   Mean   :35.76  
##             3rd Qu.:24.00      3rd Qu.: 3978   3rd Qu.:42.00  
##             Max.   :72.00      Max.   :18424   Max.   :74.00  
##             NA's   :70                         NA's   :30     
##  ResidenceDuration NumberExistingCredits NumberPeopleMaintenance Telephone
##  1:103             1 :461                1:633                   0:313    
##  2:228             2+:289                2:117                   1:437    
##  3:113                                                                    
##  4:306                                                                    
##                                                                           
##                                                                           
##                                                                           
##   Housing.Rent    Housing.Own     Housing.ForFree  Property.RealEstate
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000     
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000     
##  Median :0.000   Median :1.0000   Median :0.0000   Median :0.0000     
##  Mean   :0.176   Mean   :0.7147   Mean   :0.1093   Mean   :0.2853     
##  3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000     
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000     
##                                                                       
##  Property.Insurance Property.CarOther Property.Unknown
##  Min.   :0.000      Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:0.000      1st Qu.:0.0000    1st Qu.:0.0000  
##  Median :0.000      Median :0.0000    Median :0.0000  
##  Mean   :0.228      Mean   :0.3373    Mean   :0.1493  
##  3rd Qu.:0.000      3rd Qu.:1.0000    3rd Qu.:0.0000  
##  Max.   :1.000      Max.   :1.0000    Max.   :1.0000  
## 

This covers the “rsample” package.

recipes

The “recipes” package is our go-to for defining the steps that will be used to pre-process the data. We will see examples here for imputing missing data using the KNN algorithm, one-hot encoding the dummy variables (transforming the variables to where each level has a column), removing near zero variance predictors, and normalizing the predictors.

This begins by specifying a formula, defining the outcome variable in terms of other variables. After that, various “step” functions can be used to employ these pre-processing transformations. For our purposes here, these are step_knnimpute(), step_dummy(), step_nzv(), and step_range(). Note that we are using the function step_range() for transforming predictors between 0 and 1. For the common practice of standardizing variables, you can use the function step_normalize().

There are some useful helper functions that can assist from here, such as all_outcomes(),

set.seed(713)
creditRecipe <-
  recipe(Class ~., data = trainingSet) %>%
  step_impute_knn(all_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_nzv(all_predictors()) %>%
  step_range(all_predictors(), -all_nominal(), min = 0, max = 1)

This returns an object of class “recipe”, which contains instructions (i.e. a “recipe”) for transforming a dataset. We can extract the recipe using the prep() function, and then use the bake() function to transform a set of data based on that recipe.

trainingSet_processed <- creditRecipe %>%
  prep(trainingSet) %>%
  bake(trainingSet)
testSet_processed <- creditRecipe %>%
  prep(testSet) %>%
  bake(testSet)

We have now used the recipe steps to create fully processed training and test sets. We are ready to train machine learning algorithms.

Recursive feature elimination can be challenging and is not as streamlined in “tidymodels” as it is in “caret”. Here is documentation and code to perform this task if you wish to: https://github.com/stevenpawley/recipeselectors/issues/1


parsnip

The “parsnip” package is used to train the machine learning algorithm in question. We specify a function, we use the set_mode() function in order to select “regression” or “classification”, and then specify an implementation using the set_engine() function. The full list of these methods can be found under “Parsnip” in Addins.

We’ll try a logistic regression as well as a random forest.

glmModel <-
  logistic_reg(mode = "classification") %>%
  set_engine("glm")
rfModel <-
  rand_forest(mode = "classification", mtry = 3, trees = 500, 
              min_n = 5) %>%
  set_engine("randomForest")

We can fit these models next with the fit() function:

glmFit <- fit(glmModel, Class ~ ., data = trainingSet_processed)
rfFit <- fit(rfModel, Class ~ ., data = trainingSet_processed)

Let’s look at the output:

glmFit
## parsnip model object
## 
## 
## Call:  stats::glm(formula = Class ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                (Intercept)          EmploymentDuration  
##                    1.15003                    -2.13074  
##                     Amount                         Age  
##                   -0.61133                     0.52270  
##               Housing.Rent                 Housing.Own  
##                   -0.53973                    -0.03864  
##            Housing.ForFree         Property.RealEstate  
##                         NA                     0.84456  
##         Property.Insurance           Property.CarOther  
##                    0.48787                     0.56296  
##           Property.Unknown        ResidenceDuration_X2  
##                         NA                    -0.19285  
##       ResidenceDuration_X3        ResidenceDuration_X4  
##                   -0.10517                     0.08012  
##  NumberExistingCredits_X2.  NumberPeopleMaintenance_X2  
##                    0.20662                    -0.13077  
##               Telephone_X1  
##                   -0.38173  
## 
## Degrees of Freedom: 749 Total (i.e. Null);  735 Residual
## Null Deviance:       918 
## Residual Deviance: 862.9     AIC: 892.9
rfFit
## parsnip model object
## 
## 
## Call:
##  randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, mtry = min_cols(~3,      x), nodesize = min_rows(~5, x)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 28.13%
## Confusion matrix:
##      Bad Good class.error
## Bad   37  189  0.83628319
## Good  22  502  0.04198473

This concludes the development of models on the test and training sets. Next, we will assess performance.

yardstick

Next, we need to use these model fit objects to predict classes for the test set. After that, we will calculate and interpret some metrics describing the performance of the classifiers we trained. We need to start by creating a data frame including just the class outcome and the prediction.

set.seed(713)
classWithPredictions <- testSet_processed %>%
  dplyr::select(Class) %>%
  bind_cols(predict(rfFit, testSet_processed))

We’ll start by using the metric_set() function from the “yardstick” package to define the metrics we are interested in. Much like in the “caret” tutorial, we are interested in metrics sensitivity, specificity, and positive predictive value. We will include aliases in front of “sens” and “spec” to prevent collision with functions from the “readr” package.

metricSet <- metric_set(accuracy, yardstick::sens, yardstick::spec, ppv)

We can now just call this metric set on the result set we created earlier.

metricSet(classWithPredictions, truth = Class, estimate = .pred_class, event_level = "second")
## # A tibble: 4 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.724
## 2 sens     binary         0.920
## 3 spec     binary         0.257
## 4 ppv      binary         0.747

As you can see here, the model has reasonable sensitivity and positive predictive value, but abysmal specificity. Such is life for practitioners of machine learning. We saw this in the “caret” tutorial, with an emphasis on the imbalanced outcome class.


themis

Next, another consideration we must have is sampling method. The two primary techniques for doing this are “down-sampling” and “up-sampling”. There is a package in the “tidymodels” framework known as ‘themis’ for handling this problem, though the ‘themis’ package must be installed and loaded separately.

The definitions for these two approaches are as follows:

We can use functions step_downsample() or step_upsample() to perform these approaches, as recipe steps. Note, however, that these approaches are intended to be performed ONLY on the training set. As a result, these functions have an argument skip = TRUE, which specifies that the step is skipped when the recipe is baked; however the combination of prep() followed by juice() can be used to obtain the down-sampled version of the data. Let’s look at an example.

downRecipe <- 
  recipe(Class ~ ., data = trainingSet_processed) %>%
  themis::step_downsample(Class, under_ratio = 1, skip = TRUE)


workflow

For processing multiple steps, it is often helpful to build a “workflow”, which is essentially a container object that combines the information required to fit and predict from a model. Steps such as recipes and models can be added to these workflows and then changes and iterations can be performed quickly.

Let’s see an example with the random forest. However, this time we are not going to manually specify the parameters - but rather, we will specify that we wish to tune them later.

set.seed(713)
rfModel <- rand_forest(mtry = tune()) %>%
  set_mode("classification") %>%
  set_engine("randomForest")
rfWorkflow <- workflow() %>%
  add_recipe(downRecipe) %>%
  add_model(rfModel)

Let’s create a model fit using the training set:

rfFit <- rfWorkflow %>%
  fit(data = trainingSet_processed)

This is a lot of steps. We’re gonna keep this going in the next stage. For now, let’s just check how we’re doing…

set.seed(713)
classWithPredictions <- testSet_processed %>%
  dplyr::select(Class) %>%
  bind_cols(predict(rfFit, testSet_processed))
metricSet(classWithPredictions, truth = Class, estimate = .pred_class, event_level = "second")
## # A tibble: 4 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.596
## 2 sens     binary         0.551
## 3 spec     binary         0.703
## 4 ppv      binary         0.815

As is often the case once we correct for class imbalance, we have a more balanced profile between sensitivity and specificity at the expense of overall accuracy.


tune

But next, we need to actually tune the hyperparameters here rather than letting all of this occur automatically. Recall from the “caret” tutorial that we created a tuning grid to manually specify values for a given hyperparameter that we wanted to try. We can do the same thing in the tidymodels framework by using the tune_grid() function.

The only hyperparameter for the random forest is the “mtry” hyperparameter. You can verify this by using the parameters() function in order to access parameter information. We will create a grid of different values for the random forest to try.

Somewhat analogous to what we did with the trainControl() function from “caret”, we will be specifying a repeated cross-validation object by using the vfold_cv() function from the “rsample” library.

folds <- vfold_cv(trainingSet_processed, v = 5, repeats = 5)

Now we’re ready to tune our random forest classifier. Let’s try values 3 through 15 for the “mtry” parameter, and for our metrics of interest, let’s specify the areas under the Precision-Recall and Receiver Operating Characteristic curves.

rfGrid <- expand.grid(mtry = 3:15)
set.seed(713)
rfResamples <- rfWorkflow %>%
  tune_grid(resamples = folds,
            grid = rfGrid,
            metrics = metric_set(pr_auc, roc_auc)) %>%
  collect_metrics()
rfResamples
## # A tibble: 26 x 7
##     mtry .metric .estimator  mean     n std_err .config              
##    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1     3 pr_auc  binary     0.429    25 0.0135  Preprocessor1_Model01
##  2     3 roc_auc binary     0.651    25 0.00812 Preprocessor1_Model01
##  3     4 pr_auc  binary     0.429    25 0.0132  Preprocessor1_Model02
##  4     4 roc_auc binary     0.650    25 0.00752 Preprocessor1_Model02
##  5     5 pr_auc  binary     0.432    25 0.0134  Preprocessor1_Model03
##  6     5 roc_auc binary     0.651    25 0.00720 Preprocessor1_Model03
##  7     6 pr_auc  binary     0.429    25 0.0137  Preprocessor1_Model04
##  8     6 roc_auc binary     0.648    25 0.00715 Preprocessor1_Model04
##  9     7 pr_auc  binary     0.432    25 0.0130  Preprocessor1_Model05
## 10     7 roc_auc binary     0.650    25 0.00697 Preprocessor1_Model05
## # ... with 16 more rows

The results are in a tibble format. This makes it incredibly easy to programmatically select the best performing value for “mtry” based on our metric of choice, and then we can select that value when we evaluate performance on the testing set.

rfTuneAUC <- dplyr::filter(rfResamples, .metric == "roc_auc")
mtry_row <- which.max(rfTuneAUC$mean)
mtry <- rfTuneAUC[mtry_row, "mtry"]
rfForTesting <- rand_forest(mtry = mtry) %>%
  set_mode("classification") %>%
  set_engine("randomForest") %>%
  fit(Class ~ ., data = trainingSet_processed)

Let’s use the “ROCR” package to visualize an AUROC curve, as we did to supplement our activities in the “caret” tutorial. In the process, we’ll also illustrate a way we can alter the probability threshold for classification!

Return predictions in probability form:

predictionDF <- predict(rfForTesting, testSet_processed, 
                        type = "prob")
predictionDF
## # A tibble: 250 x 2
##    .pred_Bad .pred_Good
##        <dbl>      <dbl>
##  1     0.316      0.684
##  2     0.372      0.628
##  3     0.302      0.698
##  4     0.082      0.918
##  5     0.044      0.956
##  6     0.164      0.836
##  7     0.358      0.642
##  8     0.178      0.822
##  9     0.168      0.832
## 10     0.142      0.858
## # ... with 240 more rows

With predictions in a numeric form, we can rather easily change our classifier threshold manually (e.g. a probability of Bad at 0.333 or higher is classified as Bad).

As we saw in the “caret” tutorial, we require two arguments: “predictions” and “labels” (the source of truth), where the predictions are in probability form.

predictions <- predictionDF$.pred_Good
labels <- testSet_processed$Class
pred <- prediction(predictions, labels)
pred
## A prediction instance
##   with 250 data points

Next, we will create the curve. We want to visualize the traditional AUROC curve and will do this by specifying the metrics we wish to plot. These will be the true positive rate (sensitivity) and the false positive rate (1 - specificity).

perf <- performance(pred, "tpr", "fpr")  # Define what performance metrics we want to visualize
plot(perf, avg = "threshold", colorize = TRUE)

Nice! The last thing we are going to do is stack the predictions from a variety of different classifiers.


stacks

Let’s start by defining the candidate models for ensembling. We have focused on the random forest so far; however, let’s also try K-Nearest Neighbors (KNN) and Support Vector Machines (SVM). For the hyperparameters here, we are setting them to “tune()” like we did before, indicating that we will specify them later in the process.

rfModel <- rfModel   # Defined earlier
knnModel <-
  nearest_neighbor(mode = "classification", neighbors = tune()) %>%
  set_engine("kknn")
svmModel <-
  svm_rbf(mode = "classification", cost = tune(), 
          rbf_sigma = tune()) %>%
  set_engine("kernlab")

Each of these needs a corresponding workflow. Again, we already defined this for the random forest classifier so our work is cut out for us! We’ll use the same “downRecipe” with the down-sampling approach that we did earlier.

rfWorkflow <- rfWorkflow
knnWorkflow <- workflow() %>%
  add_recipe(downRecipe) %>%
  add_model(knnModel)
svmWorkflow <- workflow() %>%
  add_recipe(downRecipe) %>%
  add_model(svmModel)

Great. Now we’ll resample to tune the hyperparameters. Let’s do this a little bit differently than we did before. Rather than manually specify the grid of hyperparameter values to optimize for, let’s just specify how many values to try.

ctrlGrid <- control_stack_grid()
set.seed(713)
rfResamples <- rfWorkflow %>%
  tune_grid(resamples = folds,
            grid = 7,
            metrics = metric_set(roc_auc),
            control = ctrlGrid)
set.seed(713)
knnResamples <- knnWorkflow %>%
  tune_grid(resamples = folds,
            grid = 7,
            metrics = metric_set(roc_auc),
            control = ctrlGrid)
set.seed(713)
svmResamples <- svmWorkflow %>%
  tune_grid(resamples = folds,
            grid = 7,
            metrics = metric_set(roc_auc),
            control = ctrlGrid)

Now that’s done, we can start digging into the real content from the “stacks” package. We will start by using the stacks() function (this is analogous to the ggplot() function in ggplot2), and then add the ensemble candidates we have defined thus far through resampling, using the add_candidates() function.

creditStack <- 
  stacks() %>%
  add_candidates(rfResamples) %>%
  add_candidates(knnResamples) %>%
  add_candidates(svmResamples)
creditStack
## # A data stack with 3 model definitions and 20 candidate members:
## #   rfResamples: 7 model configurations
## #   knnResamples: 6 model configurations
## #   svmResamples: 7 model configurations
## # Outcome: Class (factor)

Believe it or not, this is just a tibble - nothing more and nothing less. This is one object that will contain predictions from all the various candidates. If that weren’t useful enough, we can next use a function called blend_predictions(). As the name suggests, it will combine the outputs from the stack candidates to come up with a final prediction. This is done by weighting the predictions from each candidate using a LASSO model.

creditModelStack <- 
  creditStack %>%
  blend_predictions(penalty = 0.001)
creditModelStack
## # A tibble: 4 x 3
##   member                      type        weight
##   <chr>                       <chr>        <dbl>
## 1 .pred_Good_svmResamples_1_3 svm_rbf      2.86 
## 2 .pred_Good_svmResamples_1_4 svm_rbf      2.78 
## 3 .pred_Good_rfResamples_1_7  rand_forest  2.12 
## 4 .pred_Good_rfResamples_1_5  rand_forest  0.299

We can visualize the weights here:

autoplot(creditModelStack, type = "weights")

Candidates with non-zero stacking coefficients (referred to as “members”) can now have their predictions combined. We will train them on the full training set using fit_members().

creditStackMembers <- 
  creditModelStack %>%
  fit_members()

We’re done and this object is ready to be used for prediction on the testing set!!

set.seed(713)
stackPredictions <- testSet_processed %>%
  dplyr::select(Class) %>%
  bind_cols(predict(creditStackMembers, testSet_processed))
metricSet(stackPredictions, truth = Class, estimate = .pred_class, event_level = "second")
## # A tibble: 4 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.708
## 2 sens     binary         0.920
## 3 spec     binary         0.203
## 4 ppv      binary         0.733

A final note that the most complicated approach may or may not be the best for your particular machine learning problem. It all comes down to the data structure as well as the time that you can commit to that problem. A two-hour solution may suffice, or you may need to spend two months on the problem at hand.