library(tidyverse)
library(tidymodels)
library(themis)
library(stacks)
library(caret) # loaded only to bring in the GermanCredit data
library(ROCR)
library(randomForest)
library(kknn)
library(kernlab)
data(GermanCredit)
In the previous tutorial series, we walked through the “caret” package in R for machine learning. We used the raw “GermanCredit” dataset, performed a brief exploration of it, and used the package to walk through a variety of steps: pre-processing, removing low information features, tuning hyperparameters, correcting for class imbalance, and summarizing results based on metrics we deem important. Where possible, we will perform the exact same exercise here, except we will use the “tidymodels” suite of packages to do so.
There are a few sources from which this tutorial draws influence and structure.
The “tidymodels” suite of packages is similar to the “tidyverse” suite of packages, in the sense that it bundles together many smaller packages into a unified framework. Each package has a different purpose. These packages are as follows:
We will perform the same pre-processing steps from the “caret” tutorial - essentially, we want to only keep certain variables and rename, and generate missing data first to simulate some of the reality of real-world data.
# Select variables
GermanCredit <- GermanCredit %>%
dplyr::select(Class, Duration, Amount, Age, ResidenceDuration,
NumberExistingCredits, NumberPeopleMaintenance,
Telephone, ForeignWorker, Housing.Rent, Housing.Own,
Housing.ForFree, Property.RealEstate,
Property.Insurance, Property.CarOther,
Property.Unknown) %>%
dplyr::rename("EmploymentDuration" = "Duration")
# Simulate missing data for the variables Age and Employment Duration
n <- nrow(GermanCredit)
agePct <- 3
durationPct <- 7
# Generate rows that will hold missing data
set.seed(713)
ageMissingPctRows <- sample(1:n, round(agePct/100 * n, 0))
set.seed(713)
durationMissingPctRows <- sample(1:n, round(durationPct/100 * n, 0))
# Make values NA's
GermanCredit[ageMissingPctRows, "Age"] <- NA
GermanCredit[durationMissingPctRows, "EmploymentDuration"] <- NA
# Code certain variables as factors
GermanCredit <- GermanCredit %>%
mutate(across(.cols = c("ResidenceDuration",
"NumberExistingCredits",
"NumberPeopleMaintenance", "Telephone",
"ForeignWorker"), .fns = factor))
Let’s get a look at our dataset now:
summary(GermanCredit)
## Class EmploymentDuration Amount Age
## Bad :300 Min. : 4.00 Min. : 250 Min. :19.00
## Good:700 1st Qu.:12.00 1st Qu.: 1366 1st Qu.:27.00
## Median :18.00 Median : 2320 Median :33.00
## Mean :20.92 Mean : 3271 Mean :35.65
## 3rd Qu.:24.00 3rd Qu.: 3972 3rd Qu.:42.00
## Max. :72.00 Max. :18424 Max. :75.00
## NA's :70 NA's :30
## ResidenceDuration NumberExistingCredits NumberPeopleMaintenance Telephone
## 1:130 1:633 1:845 0:404
## 2:308 2:333 2:155 1:596
## 3:149 3: 28
## 4:413 4: 6
##
##
##
## ForeignWorker Housing.Rent Housing.Own Housing.ForFree
## 0: 37 Min. :0.000 Min. :0.000 Min. :0.000
## 1:963 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :1.000 Median :0.000
## Mean :0.179 Mean :0.713 Mean :0.108
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
##
## Property.RealEstate Property.Insurance Property.CarOther Property.Unknown
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000 Median :0.000
## Mean :0.282 Mean :0.232 Mean :0.332 Mean :0.154
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000
##
“Class” is our response variable, and it has a class balance of 70/30. We now have a distribution of missing values for the EmploymentDuration and Age variables that we will address later, but the rest of our predictor variables are factors. Notice that they are coded in different ways. For example, “Telephone” and “ForeignWorker” are coded as 0 vs. 1 variables, but the variable “Housing” is divided into three components: “Housing.Rent”, “Housing.Own”, and “Housing.ForFree”. We will address this during the pre-processing process.
There are some variables with very low variance. With the “caret”
package, we can use the nearZeroVar()
function to identify
these. The process for this in “tidymodels” is called
step_nzv()
, and this step will be applied at the
pre-processing stage.
First and foremost however, we will drop the variable “ForeignWorker” due to its relative lack of variation, and will merge levels 2, 3, and 4 of the variable NumberExistingCredits, as is done in the “caret” tutorial.
GermanCredit <- dplyr::select(GermanCredit, -ForeignWorker)
GermanCredit$NumberExistingCredits <- fct_collapse(GermanCredit$NumberExistingCredits, "2+" = c("2", "3", "4"))
We will use the initial_split()
,
training()
, and testing()
functions in order
to partition our data, via a 70/30 partition, into a training set and a
test set. The initial_split()
function will create the
splitting mechanism based on an object of class “rsplit”, and then we
can use the other respective functions to create the data sets.
set.seed(713)
trainIndex <- initial_split(GermanCredit) #removed split and strata arguments due to errors. This could impact split in each set
trainingSet <- training(trainIndex)
testSet <- testing(trainIndex)
Let’s summarize the training set by itself.
summary(trainingSet)
## Class EmploymentDuration Amount Age
## Bad :226 Min. : 4.00 Min. : 276 Min. :19.00
## Good:524 1st Qu.:12.00 1st Qu.: 1377 1st Qu.:27.00
## Median :18.00 Median : 2309 Median :33.00
## Mean :20.71 Mean : 3261 Mean :35.76
## 3rd Qu.:24.00 3rd Qu.: 3978 3rd Qu.:42.00
## Max. :72.00 Max. :18424 Max. :74.00
## NA's :70 NA's :30
## ResidenceDuration NumberExistingCredits NumberPeopleMaintenance Telephone
## 1:103 1 :461 1:633 0:313
## 2:228 2+:289 2:117 1:437
## 3:113
## 4:306
##
##
##
## Housing.Rent Housing.Own Housing.ForFree Property.RealEstate
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.176 Mean :0.7147 Mean :0.1093 Mean :0.2853
## 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Property.Insurance Property.CarOther Property.Unknown
## Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.228 Mean :0.3373 Mean :0.1493
## 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000
##
This covers the “rsample” package.
The “recipes” package is our go-to for defining the steps that will be used to pre-process the data. We will see examples here for imputing missing data using the KNN algorithm, one-hot encoding the dummy variables (transforming the variables to where each level has a column), removing near zero variance predictors, and normalizing the predictors.
This begins by specifying a formula, defining the outcome variable in
terms of other variables. After that, various “step” functions can be
used to employ these pre-processing transformations. For our purposes
here, these are step_knnimpute()
,
step_dummy()
, step_nzv()
, and
step_range()
. Note that we are using the function
step_range()
for transforming predictors between 0 and 1.
For the common practice of standardizing variables, you can use the
function step_normalize()
.
There are some useful helper functions that can assist from here,
such as all_outcomes()
,
set.seed(713)
creditRecipe <-
recipe(Class ~., data = trainingSet) %>%
step_impute_knn(all_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_nzv(all_predictors()) %>%
step_range(all_predictors(), -all_nominal(), min = 0, max = 1)
This returns an object of class “recipe”, which contains instructions
(i.e. a “recipe”) for transforming a dataset. We can extract the recipe
using the prep()
function, and then use the
bake()
function to transform a set of data based on that
recipe.
trainingSet_processed <- creditRecipe %>%
prep(trainingSet) %>%
bake(trainingSet)
testSet_processed <- creditRecipe %>%
prep(testSet) %>%
bake(testSet)
We have now used the recipe steps to create fully processed training and test sets. We are ready to train machine learning algorithms.
Recursive feature elimination can be challenging and is not as streamlined in “tidymodels” as it is in “caret”. Here is documentation and code to perform this task if you wish to: https://github.com/stevenpawley/recipeselectors/issues/1
The “parsnip” package is used to train the machine learning algorithm
in question. We specify a function, we use the set_mode()
function in order to select “regression” or “classification”, and then
specify an implementation using the set_engine()
function.
The full list of these methods can be found under “Parsnip” in
Addins.
We’ll try a logistic regression as well as a random forest.
glmModel <-
logistic_reg(mode = "classification") %>%
set_engine("glm")
rfModel <-
rand_forest(mode = "classification", mtry = 3, trees = 500,
min_n = 5) %>%
set_engine("randomForest")
We can fit these models next with the fit()
function:
glmFit <- fit(glmModel, Class ~ ., data = trainingSet_processed)
rfFit <- fit(rfModel, Class ~ ., data = trainingSet_processed)
Let’s look at the output:
glmFit
## parsnip model object
##
##
## Call: stats::glm(formula = Class ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) EmploymentDuration
## 1.15003 -2.13074
## Amount Age
## -0.61133 0.52270
## Housing.Rent Housing.Own
## -0.53973 -0.03864
## Housing.ForFree Property.RealEstate
## NA 0.84456
## Property.Insurance Property.CarOther
## 0.48787 0.56296
## Property.Unknown ResidenceDuration_X2
## NA -0.19285
## ResidenceDuration_X3 ResidenceDuration_X4
## -0.10517 0.08012
## NumberExistingCredits_X2. NumberPeopleMaintenance_X2
## 0.20662 -0.13077
## Telephone_X1
## -0.38173
##
## Degrees of Freedom: 749 Total (i.e. Null); 735 Residual
## Null Deviance: 918
## Residual Deviance: 862.9 AIC: 892.9
rfFit
## parsnip model object
##
##
## Call:
## randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, mtry = min_cols(~3, x), nodesize = min_rows(~5, x))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 28.13%
## Confusion matrix:
## Bad Good class.error
## Bad 37 189 0.83628319
## Good 22 502 0.04198473
This concludes the development of models on the test and training sets. Next, we will assess performance.
Next, we need to use these model fit objects to predict classes for the test set. After that, we will calculate and interpret some metrics describing the performance of the classifiers we trained. We need to start by creating a data frame including just the class outcome and the prediction.
set.seed(713)
classWithPredictions <- testSet_processed %>%
dplyr::select(Class) %>%
bind_cols(predict(rfFit, testSet_processed))
We’ll start by using the metric_set()
function from the
“yardstick” package to define the metrics we are interested in. Much
like in the “caret” tutorial, we are interested in metrics sensitivity,
specificity, and positive predictive value. We will include aliases in
front of “sens” and “spec” to prevent collision with functions from the
“readr” package.
metricSet <- metric_set(accuracy, yardstick::sens, yardstick::spec, ppv)
We can now just call this metric set on the result set we created earlier.
metricSet(classWithPredictions, truth = Class, estimate = .pred_class, event_level = "second")
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.724
## 2 sens binary 0.920
## 3 spec binary 0.257
## 4 ppv binary 0.747
As you can see here, the model has reasonable sensitivity and positive predictive value, but abysmal specificity. Such is life for practitioners of machine learning. We saw this in the “caret” tutorial, with an emphasis on the imbalanced outcome class.
Next, another consideration we must have is sampling method. The two primary techniques for doing this are “down-sampling” and “up-sampling”. There is a package in the “tidymodels” framework known as ‘themis’ for handling this problem, though the ‘themis’ package must be installed and loaded separately.
The definitions for these two approaches are as follows:
down-sampling: In this approach, we purposely under-sample the majority class. In the example here where 70% of the rows are of credit risk “Good” and 30% are of credit risk “Bad”, we sample from the training set such that the two classes are of the same frequency (in effect, we would use only 60% of the training set).
up-sampling: In this approach, we would over-sample the minority class such that we have an equal number of rows from the two classes.
We can use functions step_downsample()
or
step_upsample()
to perform these approaches, as recipe
steps. Note, however, that these approaches are intended to be performed
ONLY on the training set. As a result, these functions have an argument
skip = TRUE
, which specifies that the step is skipped when
the recipe is baked; however the combination of prep()
followed by juice()
can be used to obtain the down-sampled
version of the data. Let’s look at an example.
downRecipe <-
recipe(Class ~ ., data = trainingSet_processed) %>%
themis::step_downsample(Class, under_ratio = 1, skip = TRUE)
For processing multiple steps, it is often helpful to build a “workflow”, which is essentially a container object that combines the information required to fit and predict from a model. Steps such as recipes and models can be added to these workflows and then changes and iterations can be performed quickly.
Let’s see an example with the random forest. However, this time we are not going to manually specify the parameters - but rather, we will specify that we wish to tune them later.
set.seed(713)
rfModel <- rand_forest(mtry = tune()) %>%
set_mode("classification") %>%
set_engine("randomForest")
rfWorkflow <- workflow() %>%
add_recipe(downRecipe) %>%
add_model(rfModel)
Let’s create a model fit using the training set:
rfFit <- rfWorkflow %>%
fit(data = trainingSet_processed)
This is a lot of steps. We’re gonna keep this going in the next stage. For now, let’s just check how we’re doing…
set.seed(713)
classWithPredictions <- testSet_processed %>%
dplyr::select(Class) %>%
bind_cols(predict(rfFit, testSet_processed))
metricSet(classWithPredictions, truth = Class, estimate = .pred_class, event_level = "second")
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.596
## 2 sens binary 0.551
## 3 spec binary 0.703
## 4 ppv binary 0.815
As is often the case once we correct for class imbalance, we have a more balanced profile between sensitivity and specificity at the expense of overall accuracy.
But next, we need to actually tune the hyperparameters here rather
than letting all of this occur automatically. Recall from the “caret”
tutorial that we created a tuning grid to manually specify values for a
given hyperparameter that we wanted to try. We can do the same thing in
the tidymodels framework by using the tune_grid()
function.
The only hyperparameter for the random forest is the “mtry”
hyperparameter. You can verify this by using the
parameters()
function in order to access parameter
information. We will create a grid of different values for the random
forest to try.
Somewhat analogous to what we did with the
trainControl()
function from “caret”, we will be specifying
a repeated cross-validation object by using the vfold_cv()
function from the “rsample” library.
folds <- vfold_cv(trainingSet_processed, v = 5, repeats = 5)
Now we’re ready to tune our random forest classifier. Let’s try values 3 through 15 for the “mtry” parameter, and for our metrics of interest, let’s specify the areas under the Precision-Recall and Receiver Operating Characteristic curves.
rfGrid <- expand.grid(mtry = 3:15)
set.seed(713)
rfResamples <- rfWorkflow %>%
tune_grid(resamples = folds,
grid = rfGrid,
metrics = metric_set(pr_auc, roc_auc)) %>%
collect_metrics()
rfResamples
## # A tibble: 26 x 7
## mtry .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 pr_auc binary 0.429 25 0.0135 Preprocessor1_Model01
## 2 3 roc_auc binary 0.651 25 0.00812 Preprocessor1_Model01
## 3 4 pr_auc binary 0.429 25 0.0132 Preprocessor1_Model02
## 4 4 roc_auc binary 0.650 25 0.00752 Preprocessor1_Model02
## 5 5 pr_auc binary 0.432 25 0.0134 Preprocessor1_Model03
## 6 5 roc_auc binary 0.651 25 0.00720 Preprocessor1_Model03
## 7 6 pr_auc binary 0.429 25 0.0137 Preprocessor1_Model04
## 8 6 roc_auc binary 0.648 25 0.00715 Preprocessor1_Model04
## 9 7 pr_auc binary 0.432 25 0.0130 Preprocessor1_Model05
## 10 7 roc_auc binary 0.650 25 0.00697 Preprocessor1_Model05
## # ... with 16 more rows
The results are in a tibble format. This makes it incredibly easy to programmatically select the best performing value for “mtry” based on our metric of choice, and then we can select that value when we evaluate performance on the testing set.
rfTuneAUC <- dplyr::filter(rfResamples, .metric == "roc_auc")
mtry_row <- which.max(rfTuneAUC$mean)
mtry <- rfTuneAUC[mtry_row, "mtry"]
rfForTesting <- rand_forest(mtry = mtry) %>%
set_mode("classification") %>%
set_engine("randomForest") %>%
fit(Class ~ ., data = trainingSet_processed)
Let’s use the “ROCR” package to visualize an AUROC curve, as we did to supplement our activities in the “caret” tutorial. In the process, we’ll also illustrate a way we can alter the probability threshold for classification!
Return predictions in probability form:
predictionDF <- predict(rfForTesting, testSet_processed,
type = "prob")
predictionDF
## # A tibble: 250 x 2
## .pred_Bad .pred_Good
## <dbl> <dbl>
## 1 0.316 0.684
## 2 0.372 0.628
## 3 0.302 0.698
## 4 0.082 0.918
## 5 0.044 0.956
## 6 0.164 0.836
## 7 0.358 0.642
## 8 0.178 0.822
## 9 0.168 0.832
## 10 0.142 0.858
## # ... with 240 more rows
With predictions in a numeric form, we can rather easily change our classifier threshold manually (e.g. a probability of Bad at 0.333 or higher is classified as Bad).
As we saw in the “caret” tutorial, we require two arguments: “predictions” and “labels” (the source of truth), where the predictions are in probability form.
predictions <- predictionDF$.pred_Good
labels <- testSet_processed$Class
pred <- prediction(predictions, labels)
pred
## A prediction instance
## with 250 data points
Next, we will create the curve. We want to visualize the traditional AUROC curve and will do this by specifying the metrics we wish to plot. These will be the true positive rate (sensitivity) and the false positive rate (1 - specificity).
perf <- performance(pred, "tpr", "fpr") # Define what performance metrics we want to visualize
plot(perf, avg = "threshold", colorize = TRUE)
Nice! The last thing we are going to do is stack the predictions from a variety of different classifiers.
Let’s start by defining the candidate models for ensembling. We have focused on the random forest so far; however, let’s also try K-Nearest Neighbors (KNN) and Support Vector Machines (SVM). For the hyperparameters here, we are setting them to “tune()” like we did before, indicating that we will specify them later in the process.
rfModel <- rfModel # Defined earlier
knnModel <-
nearest_neighbor(mode = "classification", neighbors = tune()) %>%
set_engine("kknn")
svmModel <-
svm_rbf(mode = "classification", cost = tune(),
rbf_sigma = tune()) %>%
set_engine("kernlab")
Each of these needs a corresponding workflow. Again, we already defined this for the random forest classifier so our work is cut out for us! We’ll use the same “downRecipe” with the down-sampling approach that we did earlier.
rfWorkflow <- rfWorkflow
knnWorkflow <- workflow() %>%
add_recipe(downRecipe) %>%
add_model(knnModel)
svmWorkflow <- workflow() %>%
add_recipe(downRecipe) %>%
add_model(svmModel)
Great. Now we’ll resample to tune the hyperparameters. Let’s do this a little bit differently than we did before. Rather than manually specify the grid of hyperparameter values to optimize for, let’s just specify how many values to try.
ctrlGrid <- control_stack_grid()
set.seed(713)
rfResamples <- rfWorkflow %>%
tune_grid(resamples = folds,
grid = 7,
metrics = metric_set(roc_auc),
control = ctrlGrid)
set.seed(713)
knnResamples <- knnWorkflow %>%
tune_grid(resamples = folds,
grid = 7,
metrics = metric_set(roc_auc),
control = ctrlGrid)
set.seed(713)
svmResamples <- svmWorkflow %>%
tune_grid(resamples = folds,
grid = 7,
metrics = metric_set(roc_auc),
control = ctrlGrid)
Now that’s done, we can start digging into the real content from the
“stacks” package. We will start by using the stacks()
function (this is analogous to the ggplot()
function in
ggplot2), and then add the ensemble candidates we have defined thus far
through resampling, using the add_candidates()
function.
creditStack <-
stacks() %>%
add_candidates(rfResamples) %>%
add_candidates(knnResamples) %>%
add_candidates(svmResamples)
creditStack
## # A data stack with 3 model definitions and 20 candidate members:
## # rfResamples: 7 model configurations
## # knnResamples: 6 model configurations
## # svmResamples: 7 model configurations
## # Outcome: Class (factor)
Believe it or not, this is just a tibble - nothing more and nothing
less. This is one object that will contain predictions from all the
various candidates. If that weren’t useful enough, we can next use a
function called blend_predictions()
. As the name suggests,
it will combine the outputs from the stack candidates to come up with a
final prediction. This is done by weighting the predictions from each
candidate using a LASSO model.
creditModelStack <-
creditStack %>%
blend_predictions(penalty = 0.001)
creditModelStack
## # A tibble: 4 x 3
## member type weight
## <chr> <chr> <dbl>
## 1 .pred_Good_svmResamples_1_3 svm_rbf 2.86
## 2 .pred_Good_svmResamples_1_4 svm_rbf 2.78
## 3 .pred_Good_rfResamples_1_7 rand_forest 2.12
## 4 .pred_Good_rfResamples_1_5 rand_forest 0.299
We can visualize the weights here:
autoplot(creditModelStack, type = "weights")
Candidates with non-zero stacking coefficients (referred to as
“members”) can now have their predictions combined. We will train them
on the full training set using fit_members()
.
creditStackMembers <-
creditModelStack %>%
fit_members()
We’re done and this object is ready to be used for prediction on the testing set!!
set.seed(713)
stackPredictions <- testSet_processed %>%
dplyr::select(Class) %>%
bind_cols(predict(creditStackMembers, testSet_processed))
metricSet(stackPredictions, truth = Class, estimate = .pred_class, event_level = "second")
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.708
## 2 sens binary 0.920
## 3 spec binary 0.203
## 4 ppv binary 0.733
A final note that the most complicated approach may or may not be the best for your particular machine learning problem. It all comes down to the data structure as well as the time that you can commit to that problem. A two-hour solution may suffice, or you may need to spend two months on the problem at hand.