This demo outlines some typical steps in a ML analysis using tidymodels, with a dataset ames of houses in Ames, IA.

Exploratory data analysis

Let’s say we want to predict sale price using other house characteristics like:

  • neighborhood
  • living area of the house
  • year built
  • building type
  • latitude
  • longitude

We start with exploratory data analysis to learn things like:

Distribution of living area is log normal

Some models will need this kind of transformation explicitly.

ames %>%
  ggplot(aes(Gr_Liv_Area)) +
  geom_histogram() +
  scale_x_log10()

Interaction between living area and type of home

The relationship (slope) between price and square footage is different for different kinds of homes.

## either want to specify interaction OR use a model that can learn interactions
ggplot(ames, aes(x = Gr_Liv_Area, y = Sale_Price)) + 
  geom_point(alpha = .2) + 
  facet_wrap(~ Bldg_Type) + 
  geom_smooth(method = lm) + 
  scale_x_log10()

Distribution of neighborhoods

Neighborhood is likely important for price, but there are too many neighborhoods for many kinds of models to handle gracefully!

library(leaflet)
neighborhood_pal <- colorFactor("viridis", ames$Neighborhood)

leaflet(ames) %>%
  addTiles() %>%
  addCircleMarkers(lng = ~Longitude, lat = ~Latitude, 
                   color = ~neighborhood_pal(Neighborhood),
                   popup = ~Neighborhood)

Spatial distribution of price

The spatial distribution of price is somewhat complex. We will either need to build features to capture that relationship, or use a model that learns these relationships.

price_pal <- colorNumeric("viridis", ames$Sale_Price)

leaflet(ames) %>%
  addTiles() %>%
  addCircleMarkers(lng = ~Longitude, lat = ~Latitude, 
                   color = ~price_pal(Sale_Price),
                   popup = ~Neighborhood) %>%
  addLegend(pal = price_pal, value = ~Sale_Price)

Spending our data budget

Exploratory data analysis lays the foundation for good modeling decisions, and now we move into modeling proper. Some of the first steps are making choices around “spending” your data budget.

library(tidymodels)

set.seed(123)
ames_split <- initial_split(ames, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

Using resampled datasets, whether cross-validation, bootstrap, or other, is an important tool for reliable model development. In tidymodels, you can fluently switch between options, depending on your experimental goals/design, and not have to change other aspects of your ML analysis.

set.seed(234)
ames_folds <- vfold_cv(ames_train, strata = Sale_Price) ## can switch to bootstraps()
ames_folds
## #  10-fold cross-validation using stratification 
## # A tibble: 10 x 2
##    splits           id    
##    <list>           <chr> 
##  1 <split [2K/221]> Fold01
##  2 <split [2K/221]> Fold02
##  3 <split [2K/221]> Fold03
##  4 <split [2K/221]> Fold04
##  5 <split [2K/221]> Fold05
##  6 <split [2K/220]> Fold06
##  7 <split [2K/219]> Fold07
##  8 <split [2K/219]> Fold08
##  9 <split [2K/219]> Fold09
## 10 <split [2K/217]> Fold10

Feature engineering

One aspect of the design of tidymodels is to protect against data leakage during preprocessing, as well as during model fitting/tuning. We use the concept of a recipe, with “ingredients” and “steps” that you prepare and then apply to datasets.

Illustration by Allison Horst

This feature engineering recipe builds on what we learned during exploratory data analysis to create useful features, performing transformations like taking care of the high number of neighborhoods, building new spline features for latitude and longitude, accounting for the interaction between square footage and home type, etc.

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

ames_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          6
## 
## Operations:
## 
## Log transformation on Gr_Liv_Area
## Collapsing factor levels for Neighborhood
## Dummy variables from all_nominal()
## Interactions with Gr_Liv_Area:starts_with("Bldg_Type_")
## Natural Splines on Latitude, Longitude

There are extensive options for feature engineering, including:

  • feature extraction like PCA
  • removing correlated predictors
  • even feature engineering for text and NLP

Model fitting

One of the goals of tidymodels is to provide a unified interface to many different kinds models for ML practitioners.

rf_spec <- rand_forest(trees = 1000) %>%
  set_engine("ranger") %>%
  set_mode("regression") 

lm_spec <- linear_reg() %>%
  set_engine("lm")

We go beyond that, though, and provide tools for fluent model workflows that support good statistical practice.

rf_wflow <- workflow() %>%
  add_model(rf_spec) %>%
  add_formula(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
                Latitude + Longitude)


lm_wflow <- workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(ames_rec)

We can fit one time to the training data only:

rf_wflow %>% fit(ames_train)
## == Workflow [trained] ================================================
## Preprocessor: Formula
## Model: rand_forest()
## 
## -- Preprocessor ------------------------------------------------------
## Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
##     Latitude + Longitude
## 
## -- Model -------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  1000 
## Sample size:                      2199 
## Number of independent variables:  6 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.004958666 
## R squared (OOB):                  0.8408472

But good statistical practice for comparing models is to use resample datasets like our cross-validation folds.

keep_pred <- control_resamples(save_pred = TRUE)
doParallel::registerDoParallel()

rf_res <- 
  rf_wflow %>% 
  fit_resamples(resamples = ames_folds, control = keep_pred)

lm_res <- 
  lm_wflow %>% 
  fit_resamples(resamples = ames_folds, control = keep_pred)

Model evaluation

The tidymodels ecosystem uses tidyverse principles, which means that analysts and data scientists can use familiar data manipulation and visualization tools to, for example, explore and graph outputs.

bind_rows(
  collect_metrics(rf_res),
  collect_metrics(lm_res)
)
## # A tibble: 4 x 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   0.0699    10 0.00209 Preprocessor1_Model1
## 2 rsq     standard   0.845     10 0.00887 Preprocessor1_Model1
## 3 rmse    standard   0.0770    10 0.00254 Preprocessor1_Model1
## 4 rsq     standard   0.810     10 0.0105  Preprocessor1_Model1

The random forest with no feature engineering performed somewhat better than the linear model with feature engineering.

bind_rows(
  collect_predictions(rf_res) %>%
    mutate(model = "random forest"),
  collect_predictions(lm_res) %>%
    mutate(model = "linear model"),
) %>%
  ggplot(aes(Sale_Price, .pred, color = id)) +
  geom_abline(slope = 1, lty = 2, color = "gray80") +
  geom_point(alpha = 0.3, show.legend = FALSE) +
  facet_wrap(~model)

Hyperparameter tuning

The tidymodels framework has extensive support for hyperparameter tuning, for both feature engineering parameters and model parameters. Another feature in the tidymodels ecosystem is the ability to automatically generate ML code appropriate to a specific dataset.

library(usemodels)

use_xgboost(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
              Latitude + Longitude, data = ames_train)
## many options available, like use_glmnet(), use_ranger(), etc

When a practitioner calls a function like use_xgboost(), they get code like the following, with feature engineering and modeling code specifically designed for their use case. It detects what kind of data they are using and chooses feature engineering steps based on this. It does not make all decisions for the practitioner, leaving important choices about the resampling strategy and parameter grid (i.e. experimental design choices) to the user.

xgboost_recipe <- 
  recipe(formula = Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>% 
  step_novel(all_nominal(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) 

xgboost_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = 0.01, 
             loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

set.seed(93056)
xgboost_tune <-
  tune_grid(xgboost_workflow, resamples = ames_folds, grid = 7)

We have extensive support for automatic visualization methods, as well as the ability to customize all plots.

autoplot(xgboost_tune)

As it typical with xgboost, there are many combinations of parameters that give good results. Practitioners can find the best performing models.

show_best(xgboost_tune, metric = "rmse")
## # A tibble: 5 x 11
##   trees min_n tree_depth loss_reduction sample_size .metric .estimator
##   <int> <int>      <int>          <dbl>       <dbl> <chr>   <chr>     
## 1  1899    36          7 0.0126               0.698 rmse    standard  
## 2   698    25         10 0.00000197           0.992 rmse    standard  
## 3  1480    32         14 0.000000000470       0.320 rmse    standard  
## 4  1348    19          3 0.000137             0.422 rmse    standard  
## 5   884     9          2 0.0305               0.160 rmse    standard  
## # ... with 4 more variables: mean <dbl>, n <int>, std_err <dbl>,
## #   .config <chr>

These model parameters can be used to update/finalize the original workflow, which can then be deployed via an API or other method.

best_params <- select_best(xgboost_tune)

xgb_fitted <- finalize_workflow(xgboost_workflow, best_params) %>%
  last_fit(ames_split)

xgb_fitted
## # Resampling results
## # Manual resampling 
## # A tibble: 1 x 6
##   splits      id        .metrics    .notes    .predictions   .workflow
##   <list>      <chr>     <list>      <list>    <list>         <list>   
## 1 <split [2.~ train/te~ <tibble [2~ <tibble ~ <tibble [731 ~ <workflo~

To see examples of using tidymodels for modeling and ML, check out these screencasts.