Chapter 8 - A model workflow

A model workflow

In the previous two chapters, we discuessed the recipes and parscip packages. These packages can be used to prepare the data for analysis and fitting the model. This chapter introduces a new object called a model workflow. The purpose of this object is to encapsulate the major pieces of the modeling process. The workflow is important in two ways. First, using a workflow object encourages good methodology since it is a single point of entry to the estimation components of a data analysis. Second, it enables the user to better organzie their projects. These two points are discussed in the following sections.

Where does the model begin and end?

So far, when we have used the term ‘the mode’, we have meant a structural equation that relates some predictors to one or more outcomes. Let’s consider again linear regression as an example. The outcome data are denoted as y(i), where there are i=1… n sampls in the training set. Suppose that there are p predictors x(ij) hat are used in the model. Linear regression produces a model equation of

y^(i) = b(0) + b(1)*x(i1) + b(2)*x(i2) +...+ b(p)*x(ip)

The conventional way of thinking about the modeling process is that it only includes the mode fit.

For some data sets that are straightforward in nature, fitting the model itself may be the entire process. However, there are a variety of choices and additional steps that often occur before the model is fit:

While our example model has p predictors, it is common to start with more than p candidate predictors. Through exploratory data analysis. In other cases, a feature selection algorithm maybe used to make a data-driven choice for the minimum predictor set for the model.
There are times when the value of an important predictor is missing. Rather than eliminating this sample from the data set, the missing value could be imputed using other values in the data. For example, if x1 were missing but was correlated with predictors x2 and x3, an imputation method could estimate the missing x1 observation from the values of x2 and x3.
It may be beneficial to transform the scale of a predictor. If there is not a prior information on what the new scale should be, we can estimate the proper scale using a statistical transformation techique, the existing data, and some optimization criterion. Other transformation, such as PCA, take groups of predictors and transform them into newe features that are used as the predictors

While the examples above are related to steps that occur before the model fit, there may also be operations that occur after the model is created. When a classification model is created where the outcome is binary (e.g.,` event and non-event)

It is customary to use a 50% probability cutoff to create a discrete class prediction, also known as a “hard prediction”. For example, a classification model might estimate that the probability of an event was 62%.Using the typical default, the hard prediction would be event. However, the model may need to be more focused on reducing false positive results (i.e., where true non-events are classified as events). One way to do this is to raise the cutoff from 50% to some greater value. This increases the level of evidence required to call a new sample an event. While this reduces the true positive rate (which is bad), it may have a more dramatic effect on reducing false positives. The choice of the cutoff value should be optimized using data. This is an example of a post-processing step that has a significant effect on how well the model works, even though it is not contained in the model fitting step.

It is important to focus on the broader modeling process, instead of only fitting the specific model used to estimate parameters. This broader process includes any preprocessing steps, the model fit itself, as well as potential post-processing activities. In this book, wew will refer to this broader process as the model workflow and include in it any data-driven activities that are used to produce a final model equation.

In other software, such as Python or Spark, similar collections of steps are called pipelines. In tidymodels, the term “pipeline” already connotes a sequence of operations chained together with a pipe operator (such as %>%). Rather than using ambiguous terminology in this context, we call the sequence of computational operations related to modeling workflows.

To illustrate, consider PCA signal extraction. This was previously mentioned in Section 6.6 as a way to replace correlated predictors with new artificial features that are uncorrelated and capture most of the information in the original set. The new features would be used as the predictors and least squares regression could be used to estimate the model parameters.

The fallacy here is that, although PCA does significant computations to produce the components, its operations are assumed to have no uncertainty associated with them. The PCA components are treated as known and, if not included in the model workflow, the effect of PCA could not be adequately measured.

An appropriate approach would be:

approach

In this way, the PCA preprocessing is considered part of the modeling process:

Workflow basics

The workflow package allows the user to bind modeling and preprocessing objects together. Let’s start again with the Ames data and a simple linear model:

library(tidymodels)

## -- Attaching packages --------

## v broom     0.7.0      v recipes   0.1.13
## v dials     0.0.8      v rsample   0.0.7 
## v dplyr     1.0.0      v tibble    3.0.3 
## v ggplot2   3.3.2      v tidyr     1.1.0 
## v infer     0.5.3      v tune      0.1.1 
## v modeldata 0.0.2      v workflows 0.1.2 
## v parsnip   0.1.2      v yardstick 0.0.7 
## v purrr     0.3.4

## -- Conflicts -----------------
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()

setwd('C:/Users/DellPC/Desktop/Corner/R_source_code/Julia_Silge/tidy_model_R_book')

ames <- read.csv('ames.csv')

ames_split <- initial_split(ames, prob= 0.80)
ames_split

## <Analysis/Assess/Total>
## <2198/732/2930>

ames_train <- training(ames_split)
ames_test <- testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

library(tidymodels)

lm_model <- linear_reg() %>%
               set_engine('lm')

A workflow always requires a parsnip model object:

lm_wflow <- 
               workflow() %>%
               add_model(lm_model)

lm_wflow

## == Workflow ==================
## Preprocessor: None
## Model: linear_reg()
## 
## -- Model ---------------------
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Notice that we have not yet specified how this workflow should preprocess the data: Preprocessor: None

If our model were very simple, a standard R formula can be used as a preprocessor:

lm_wflow <-
               lm_wflow %>%
               add_formula(Sale_Price ~ Longitude + Latitude)

Workflows have a fit() method that can be used to create the mode. Using the objects create in Section 7.5:

lm_fit <- fit(lm_wflow, ames_train)

lm_fit

## == Workflow [trained] ========
## Preprocessor: Formula
## Model: linear_reg()
## 
## -- Preprocessor --------------
## Sale_Price ~ Longitude + Latitude
## 
## -- Model ---------------------
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
## (Intercept)    Longitude     Latitude  
##    -298.565       -1.996        2.781

We can also predict() on the fitted workflow:

predict(lm_fit, ames_test %>% slice(1:3))

## # A tibble: 3 x 1
##   .pred
##   <dbl>
## 1  5.27
## 2  5.28
## 3  5.26

The predict() method follows all of the same rules and naming conventions that we described for the parsnip package.

Both the model and preprocessor can be removed or updated:

lm_fit %>% update_formula(Sale_Price ~ Longitude)

## == Workflow ==================
## Preprocessor: Formula
## Model: linear_reg()
## 
## -- Preprocessor --------------
## Sale_Price ~ Longitude
## 
## -- Model ---------------------
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Note that, in this new object the output shows the previous fitted model was removed since the new formula is inconsistent with the previous model fit.

Workflows and recipes

Instead of using model formulas, recipe objects can also be used to preprocess data for modeling.

#lm_wflow %>%add_recipe(ames_rec)

#Error: A recipe cannot be added when a formula already exists.

That did not work! we can only have one preprocessing method at a time, so we need to remove the formula before adding the recipe.

lm_wflow <- lm_wflow %>%
               remove_formula() %>%
               add_recipe(ames_rec)

lm_wflow

## == Workflow ==================
## Preprocessor: Recipe
## Model: linear_reg()
## 
## -- Preprocessor --------------
## 5 Recipe Steps
## 
## * step_log()
## * step_other()
## * step_dummy()
## * step_interact()
## * step_ns()
## 
## -- Model ---------------------
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

We described the prep(), bake() and juice() functions for using the recipe with a modeling function. This can be onerous, so the fit() method for workflow objects automates this process:

lm_fit <- fit(lm_wflow, ames_train)

predict(lm_fit, ames_test %>% slice(1:3))

## # A tibble: 3 x 1
##   .pred
##   <dbl>
## 1  5.44
## 2  5.24
## 3  5.52

If we need the bare model object or recip, there are pull_* functions that can retrieve them:

# Get the recipe and run tidy() method:

lm_fit %>% pull_workflow_prepped_recipe() %>%
               tidy()

## # A tibble: 5 x 6
##   number operation type     trained skip  id            
##    <int> <chr>     <chr>    <lgl>   <lgl> <chr>         
## 1      1 step      log      TRUE    FALSE log_2NUEF     
## 2      2 step      other    TRUE    FALSE other_PAnzl   
## 3      3 step      dummy    TRUE    FALSE dummy_OiZwH   
## 4      4 step      interact TRUE    FALSE interact_iPbbC
## 5      5 step      ns       TRUE    FALSE ns_lkugu

# Get the recipe and run tidy() method:

lm_fit %>% pull_workflow_fit() %>%
               tidy() %>% slice(1:5)

## # A tibble: 5 x 5
##   term                       estimate std.error statistic  p.value
##   <chr>                         <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                 0.247    0.391        0.631 5.28e- 1
## 2 Gr_Liv_Area                 0.377    0.0714       5.29  1.38e- 7
## 3 Year_Built                  0.00191  0.000143    13.3   7.33e-39
## 4 Neighborhood_Clear_Creek   -0.0790   0.0284      -2.78  5.50e- 3
## 5 Neighborhood_College_Creek -0.0527   0.0345      -1.53  1.27e- 1

How does a workflow use the formula?

Chapter Summary

library(tidymodels)
data(ames)

ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

lm_model <- linear_reg() %>% set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ames_rec)

lm_fit <- fit(lm_wflow, ames_train)

Once we have a model, we need to know how well it works. A quantitative approach for estimating effectiveness allow us to understand the model, to compare different models, or to tweak the model to improve performance. Our focus in tidymodels is on empirical validation, this usually means using data that were not used to create the model as the substrate to measure effectiveness.

The best approach to empirical validation involves using rsampling methods that will be introduced in Chapter 10. In this chapter, we will use the test set for illustration purposes and to motivate the need for empirical validation. Keep in mind that the test set can only be used once

The choice of which metrics to examine can be critical. In later chapters, certain model parameters will be empirically optimized and a primary performance metric will be used to choose the best sub-model. Choosing the wrong method can easily result in unintended consequences. For example, two common metrics for regression models are the root mean squared error (RMSE) and the coefficient of determination (R^2).

The former measures accuracy while the latter measures correlation

approach

A model optimized for RMSE has more variability but has relatively uniform accuracy across the range of the outcome. The right panel shows that there is a tighter correlation between the observed and predicted values but this model performs poorly in the tails.

This chapter will largely focus on the yardstick package. Before illustrating syntax, let’s explore whether empirical validation using performance metrics is worthwhile when a model is focused on inference rather than prediction.

Performance metrics and inference

The effectiveness of any given model depends on how the model will be used. An inferential model is used primarily to understand relationships, and typically is discussed with a strong focus on the choice (and validity) of probabilistic distribution and other generative qualities that define the model.

For a model used primarily for prediction, by constrast, predictive strength is primarily and concerns about underlying statistical qualities may be less important. Predictive strength is usually focused on how close our predictions come to the observed data,.. fidelity of the model predictions to the actual results. This chapter focuses on functions that can be used to measure predictive strength. However, our advice for those developing inferential models is to use these techniques even when the model will not be used with the primary goal of prediction

A longstanding issue with the practice of inferential statistics is that, with a focus purely on inference, it is difficult to assess the credibility of a model

One missing piece of information in this approach is how closely this model fits the actual data. Using resampling methods, discussed in Chapter 10, we can estimate the accuracy of this model to be about 73.3%. Accuracy is often a poor measure of model performance; we use it here because it is commonly understood. If the model has 73.3% fidelity to the data, should we trust the conclusions produced by the model? We might think so until we realize that the baseline rate of non-impaired patients in the data is 72.7%. This means that, despite our statistical analysis, the two-factor model appears to be only 0.6% better than a simple heuristic that always predicts patients to be unimpaired, irregardless of the observed data.

The point of this analysis is to demonstrate the idea that optimization of statistical characteristics of the model does not imply that the model fits the data well. Even for purely inferential models, some measure of fidelity to the data should accompany the inferential results. Using this, the consumers of the analyses can calibrate their expectations of the results of the statistical analysis.

Regression Metrics

Recall frin Sectuib 7.3 that tidymodels prediction functions produce tibbles with columns for the predicted values. These columns have consistent names, and the functions in the yardstick package that product performance metrics have consistent interfaces. The functions are data frame-based, as opposed to vector based, with the genral syntax of:

#functioN(data, truth,...)

library(tidymodels)

Where data is a data frame or tibble and truth is the column with observed outcome values. The elipses or other argumetns are used to specify the column(s) cotaining the predictions

To illustrate, let’s take the model. The lm_wflow_fit object was a linear regression model whose predictor set was supplenmented with an interaction and spline functions for longitude and latitude.

Although we do not advise using the test set at this juncture of the modeling process, it will be used to illustrate funcationality and syntax. The data frame ames_test consists of 731 properties. To start, let’s produce predictions:

ames_test_res <- predict(lm_fit, new_data = ames_test %>% select(-Sale_Price))

ames_test_res

## # A tibble: 731 x 1
##    .pred
##    <dbl>
##  1  5.31
##  2  5.30
##  3  5.17
##  4  5.52
##  5  5.09
##  6  5.49
##  7  5.51
##  8  5.43
##  9  5.55
## 10  5.24
## # ... with 721 more rows

The predicted numeric outcome from the regression model is named .pred. Let’s match the predicted values with their corresponding observed outcome values:

ames_test_res <- bind_cols(ames_test_res, ames_test %>% select(Sale_Price))
ames_test_res

## # A tibble: 731 x 2
##    .pred Sale_Price
##    <dbl>      <dbl>
##  1  5.31       5.39
##  2  5.30       5.28
##  3  5.17       5.27
##  4  5.52       5.60
##  5  5.09       5.02
##  6  5.49       5.49
##  7  5.51       5.60
##  8  5.43       5.34
##  9  5.55       5.51
## 10  5.24       5.30
## # ... with 721 more rows

Note that both the predicted and observed outcomes are in log10 units. It is best practice to analyze the predictions on the transformed scale (if one were used) even if the predictions are reported using the original units.

Let’s plot the data before computing metrics:

ggplot(ames_test_res, aes(x = Sale_Price, y = .pred)) + 
  # Create a diagonal line:
  geom_abline(lty = 2) + 
  geom_point(alpha = 0.5) + 
  labs(y = "Predicted Sale Price (log10)", x = "Sale Price (log10)") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()

There is one property that is substantially over-predicted

Let’s compute the root mean squared error for this model using the rmse() function:

rmse(ames_test_res, truth = Sale_Price, estimate = .pred)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0808

The output above shows the standard format of the output of yardstick functions. Metrics for numeric outcomes usually have a value of “standard” for the .estimator column. Examples with different values for this column are shown below.

To compute multiple metrics at once, we can create a metric set. Let’s add R^2 and the mean absolute error:

ames_metrics <- metric_set(rmse, rsq, mae)

ames_metrics(ames_test_res, truth = Sale_Price, estimate = .pred)

## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0808
## 2 rsq     standard      0.795 
## 3 mae     standard      0.0558

This tidy data format stacks the metrics vertically.

Binary classification metrics

To illustrate other ways to measure model performance, we will switch to a different example. The modeldata package contains example predictions from a test data set with two classes ('Class1' and ‘Class2’)

data(two_class_example)

str(two_class_example)

## 'data.frame':    500 obs. of  4 variables:
##  $ truth    : Factor w/ 2 levels "Class1","Class2": 2 1 2 1 2 1 1 1 2 2 ...
##  $ Class1   : num  0.00359 0.67862 0.11089 0.73516 0.01624 ...
##  $ Class2   : num  0.996 0.321 0.889 0.265 0.984 ...
##  $ predicted: Factor w/ 2 levels "Class1","Class2": 2 1 2 1 2 1 1 1 2 2 ...

The second and third columns are the predicted class probabilities for the test set while predicted are the discrete predictions.

For the hard class predictions, there are a variety of yeardstick functions that are helpful:

conf_mat(two_class_example, truth = truth, estimate = predicted)

##           Truth
## Prediction Class1 Class2
##     Class1    227     50
##     Class2     31    192

accuracy(two_class_example, truth = truth, estimate = predicted)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.838

mcc(two_class_example, truth, predicted)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mcc     binary         0.677

f_meas(two_class_example, truth, predicted)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.849

For binary classification data sets, these functions have a standard argument called event_level. The default is that the first level of the outcome factor is teh event of interest.

There is some heterogeneity in R functions in this regard; some use the first level and others the second to denote the event of interest. We consider it more intuitive that the first level is the most important. The second level logic is borne of encoding the outcome as 0/1 (in which case the second value is the event) and unfortunately remains in some packages. However, tidymodels (along with many other R packages) require a categorical outcome to be encoded as a factor and, for this reason, the legacy justification for the second level as the event becomes irrelevant.

As a example where the second class is the event:

f_meas(two_class_example, truth = truth, predicted, event_level = 'second')

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.826

In the output, the .estimator value of ‘binary’ indicates that the standard formula for binary classes will be used.

There are numerous classification metrics that use the predicted probabilities as inputs rather than the hard class predictions.

For example, the receiver operating chaacteristic (ROC) curve computes the sensitivity and specificity over a continuum of different event thresholds.

There are two yardstick functions for this method: roc_curve() computes the data points that make up the ROC curve and roc_auc() computes the area under the curve.

The interfaces to these types of metric functions use the ... argument placeholder to pass in the appropriate class probability column. For two-class problems, the probability column for the event of interest is passed into the function

two_class_curve <- roc_curve(two_class_example, truth, Class1)

two_class_curve

## # A tibble: 502 x 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1 -Inf           0                 1
##  2    1.79e-7     0                 1
##  3    4.50e-6     0.00413           1
##  4    5.81e-6     0.00826           1
##  5    5.92e-6     0.0124            1
##  6    1.22e-5     0.0165            1
##  7    1.40e-5     0.0207            1
##  8    1.43e-5     0.0248            1
##  9    2.38e-5     0.0289            1
## 10    3.30e-5     0.0331            1
## # ... with 492 more rows

autoplot(two_class_curve)

There are a number of other functions that use probability estimates, including gain_curve(), lift_curve()and pr_curve().

Multi-Class Classification Metrics

What about data with three or more classes? To demonstrate, let’s explore a different example data set has four classes:

data(hpc_cv)
str(hpc_cv)

## 'data.frame':    3467 obs. of  7 variables:
##  $ obs     : Factor w/ 4 levels "VF","F","M","L": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pred    : Factor w/ 4 levels "VF","F","M","L": 1 1 1 1 1 1 1 1 1 1 ...
##  $ VF      : num  0.914 0.938 0.947 0.929 0.942 ...
##  $ F       : num  0.0779 0.0571 0.0495 0.0653 0.0543 ...
##  $ M       : num  0.00848 0.00482 0.00316 0.00579 0.00381 ...
##  $ L       : num  1.99e-05 1.01e-05 5.00e-06 1.56e-05 7.29e-06 ...
##  $ Resample: chr  "Fold01" "Fold01" "Fold01" "Fold01" ...

As before, there are factors for the observed and predicted outcomes along with four other columns of predicted probabilities for each class. These data also include a Resample column. These results are not for out-of-sample predictions associated with 10-fold cross validation. For the time being, this column will be ignored

The functions for metrics that use the discrete class probabilities are identical:

accuracy(hpc_cv, obs, pred)

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.709

mcc(hpc_cv, obs, pred)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mcc     multiclass     0.515

Note that, in these results, a ‘multiclass’ .estimator is listed. Like ‘binary’, this indicates the formulas for outcomes with three or more class levels was used.

There are methods for using metrics that are specific to outcomes with two classes for data sets with more than two classes. For example, a metric such as sensitivity measures the true positive rate which, by definition, is specific to two classes (i.e., “event” and “non-event”). How can this metric be used in our example data?

Macro-averaging computes a set of one-versus-all metrics using the standard two-class statistics. These are averaged.
Macro-weighted averaging does the same but the average is weighted by the number of samples in each class.
Micro-averaging computes the contribution for each class, aggregates them, then computes a single metric from the aggregates.

class_totals <- 
  count(hpc_cv, obs, name = "totals") %>% 
  mutate(class_wts = totals / sum(totals))
class_totals

##   obs totals  class_wts
## 1  VF   1769 0.51023940
## 2   F   1078 0.31093164
## 3   M    412 0.11883473
## 4   L    208 0.05999423

cell_counts <- 
  hpc_cv %>% 
  group_by(obs, pred) %>% 
  count() %>% 
  ungroup()

cell_counts

## # A tibble: 16 x 3
##    obs   pred      n
##    <fct> <fct> <int>
##  1 VF    VF     1620
##  2 VF    F       141
##  3 VF    M         6
##  4 VF    L         2
##  5 F     VF      371
##  6 F     F       647
##  7 F     M        24
##  8 F     L        36
##  9 M     VF       64
## 10 M     F       219
## 11 M     M        79
## 12 M     L        50
## 13 L     VF        9
## 14 L     F        60
## 15 L     M        28
## 16 L     L       111

# Compute the four sensitivities using 1-vs-all
one_versus_all <- 
  cell_counts %>% 
  filter(obs == pred) %>% 
  full_join(class_totals, by = "obs") %>% 
  mutate(sens = n / totals)

one_versus_all

## # A tibble: 4 x 6
##   obs   pred      n totals class_wts  sens
##   <fct> <fct> <int>  <int>     <dbl> <dbl>
## 1 VF    VF     1620   1769    0.510  0.916
## 2 F     F       647   1078    0.311  0.600
## 3 M     M        79    412    0.119  0.192
## 4 L     L       111    208    0.0600 0.534

# Three different estimates:
one_versus_all %>% 
  summarize(
    macro = mean(sens), 
    macro_wts = weighted.mean(sens, class_wts),
    micro = sum(n) / sum(totals)
  )

## # A tibble: 1 x 3
##   macro macro_wts micro
##   <dbl>     <dbl> <dbl>
## 1 0.560     0.709 0.709

one_versus_all

## # A tibble: 4 x 6
##   obs   pred      n totals class_wts  sens
##   <fct> <fct> <int>  <int>     <dbl> <dbl>
## 1 VF    VF     1620   1769    0.510  0.916
## 2 F     F       647   1078    0.311  0.600
## 3 M     M        79    412    0.119  0.192
## 4 L     L       111    208    0.0600 0.534

Thankfully, there are easier methods for obtaining these results:

sensitivity(hpc_cv, obs, pred, estimator = "macro")

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    macro          0.560

sensitivity(hpc_cv, obs, pred, estimator = "macro_weighted")

## # A tibble: 1 x 3
##   .metric .estimator     .estimate
##   <chr>   <chr>              <dbl>
## 1 sens    macro_weighted     0.709

sensitivity(hpc_cv, obs, pred, estimator = "micro")

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    micro          0.709

For metrics using probability estimates, there are some metrics with multi-class analogs. For example, Hand and Till (2001) determined a multi-class technique for ROC curves. In this case, all of the class probability columns must be given to the function:

roc_auc(hpc_cv, obs, VF, F, M, L)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till      0.829

Macro-averaging is also available:

roc_auc(hpc_cv, obs, VF, F, M, L, estimator = "macro_weighted")

## # A tibble: 1 x 3
##   .metric .estimator     .estimate
##   <chr>   <chr>              <dbl>
## 1 roc_auc macro_weighted     0.868

Finally, all of these performance metrics can be computed using dplyr groupings. Recall that these data have a column for the resampling groups. Passing a grouped data frame to the metric function will compute the metrics for each group:

hpc_cv %>% 
  group_by(Resample) %>% 
  accuracy(obs, pred)

## # A tibble: 10 x 4
##    Resample .metric  .estimator .estimate
##    <chr>    <chr>    <chr>          <dbl>
##  1 Fold01   accuracy multiclass     0.726
##  2 Fold02   accuracy multiclass     0.712
##  3 Fold03   accuracy multiclass     0.758
##  4 Fold04   accuracy multiclass     0.712
##  5 Fold05   accuracy multiclass     0.712
##  6 Fold06   accuracy multiclass     0.697
##  7 Fold07   accuracy multiclass     0.675
##  8 Fold08   accuracy multiclass     0.721
##  9 Fold09   accuracy multiclass     0.673
## 10 Fold10   accuracy multiclass     0.699

The groupings also translate to the autoplot() methods:

# Four 1-vs-all ROC curves for each fold
hpc_cv %>% 
  group_by(Resample) %>% 
  roc_curve(obs, VF, F, M, L) %>% 
  autoplot()

This can be a quick visualization method for model effectiveness.

Chaptere summary

Functions from the yardstick package measure the effectiveness of a model using data. The primary interface is based on data frames (as opposed to having vector arguments). There are a variety of regression and classification metrics and, within these, there are sometimes different estimators for the statistics.