This demo outlines some typical steps in a ML analysis using tidymodels, with a dataset ames of houses in Ames, IA.
Let’s say we want to predict sale price using other house characteristics like:
We start with exploratory data analysis to learn things like:
Some models will need this kind of transformation explicitly.
ames %>%
ggplot(aes(Gr_Liv_Area)) +
geom_histogram() +
scale_x_log10()
The relationship (slope) between price and square footage is different for different kinds of homes.
## either want to specify interaction OR use a model that can learn interactions
ggplot(ames, aes(x = Gr_Liv_Area, y = Sale_Price)) +
geom_point(alpha = .2) +
facet_wrap(~ Bldg_Type) +
geom_smooth(method = lm) +
scale_x_log10()
Neighborhood is likely important for price, but there are too many neighborhoods for many kinds of models to handle gracefully!
library(leaflet)
neighborhood_pal <- colorFactor("viridis", ames$Neighborhood)
leaflet(ames) %>%
addTiles() %>%
addCircleMarkers(lng = ~Longitude, lat = ~Latitude,
color = ~neighborhood_pal(Neighborhood),
popup = ~Neighborhood)
The spatial distribution of price is somewhat complex. We will either need to build features to capture that relationship, or use a model that learns these relationships.
price_pal <- colorNumeric("viridis", ames$Sale_Price)
leaflet(ames) %>%
addTiles() %>%
addCircleMarkers(lng = ~Longitude, lat = ~Latitude,
color = ~price_pal(Sale_Price),
popup = ~Neighborhood) %>%
addLegend(pal = price_pal, value = ~Sale_Price)
Exploratory data analysis lays the foundation for good modeling decisions, and now we move into modeling proper. Some of the first steps are making choices around “spending” your data budget.
library(tidymodels)
set.seed(123)
ames_split <- initial_split(ames, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
Using resampled datasets, whether cross-validation, bootstrap, or other, is an important tool for reliable model development. In tidymodels, you can fluently switch between options, depending on your experimental goals/design, and not have to change other aspects of your ML analysis.
set.seed(234)
ames_folds <- vfold_cv(ames_train, strata = Sale_Price) ## can switch to bootstraps()
ames_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 x 2
## splits id
## <list> <chr>
## 1 <split [2K/221]> Fold01
## 2 <split [2K/221]> Fold02
## 3 <split [2K/221]> Fold03
## 4 <split [2K/221]> Fold04
## 5 <split [2K/221]> Fold05
## 6 <split [2K/220]> Fold06
## 7 <split [2K/219]> Fold07
## 8 <split [2K/219]> Fold08
## 9 <split [2K/219]> Fold09
## 10 <split [2K/217]> Fold10
One aspect of the design of tidymodels is to protect against data leakage during preprocessing, as well as during model fitting/tuning. We use the concept of a recipe, with “ingredients” and “steps” that you prepare and then apply to datasets.
Illustration by Allison Horst
This feature engineering recipe builds on what we learned during exploratory data analysis to create useful features, performing transformations like taking care of the high number of neighborhoods, building new spline features for latitude and longitude, accounting for the interaction between square footage and home type, etc.
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 6
##
## Operations:
##
## Log transformation on Gr_Liv_Area
## Collapsing factor levels for Neighborhood
## Dummy variables from all_nominal()
## Interactions with Gr_Liv_Area:starts_with("Bldg_Type_")
## Natural Splines on Latitude, Longitude
There are extensive options for feature engineering, including:
One of the goals of tidymodels is to provide a unified interface to many different kinds models for ML practitioners.
rf_spec <- rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
set_mode("regression")
lm_spec <- linear_reg() %>%
set_engine("lm")
We go beyond that, though, and provide tools for fluent model workflows that support good statistical practice.
rf_wflow <- workflow() %>%
add_model(rf_spec) %>%
add_formula(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude)
lm_wflow <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(ames_rec)
We can fit one time to the training data only:
rf_wflow %>% fit(ames_train)
## == Workflow [trained] ================================================
## Preprocessor: Formula
## Model: rand_forest()
##
## -- Preprocessor ------------------------------------------------------
## Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
## Latitude + Longitude
##
## -- Model -------------------------------------------------------------
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 1000
## Sample size: 2199
## Number of independent variables: 6
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 0.004958666
## R squared (OOB): 0.8408472
But good statistical practice for comparing models is to use resample datasets like our cross-validation folds.
keep_pred <- control_resamples(save_pred = TRUE)
doParallel::registerDoParallel()
rf_res <-
rf_wflow %>%
fit_resamples(resamples = ames_folds, control = keep_pred)
lm_res <-
lm_wflow %>%
fit_resamples(resamples = ames_folds, control = keep_pred)
The tidymodels ecosystem uses tidyverse principles, which means that analysts and data scientists can use familiar data manipulation and visualization tools to, for example, explore and graph outputs.
bind_rows(
collect_metrics(rf_res),
collect_metrics(lm_res)
)
## # A tibble: 4 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 0.0699 10 0.00209 Preprocessor1_Model1
## 2 rsq standard 0.845 10 0.00887 Preprocessor1_Model1
## 3 rmse standard 0.0770 10 0.00254 Preprocessor1_Model1
## 4 rsq standard 0.810 10 0.0105 Preprocessor1_Model1
The random forest with no feature engineering performed somewhat better than the linear model with feature engineering.
bind_rows(
collect_predictions(rf_res) %>%
mutate(model = "random forest"),
collect_predictions(lm_res) %>%
mutate(model = "linear model"),
) %>%
ggplot(aes(Sale_Price, .pred, color = id)) +
geom_abline(slope = 1, lty = 2, color = "gray80") +
geom_point(alpha = 0.3, show.legend = FALSE) +
facet_wrap(~model)
The tidymodels framework has extensive support for hyperparameter tuning, for both feature engineering parameters and model parameters. Another feature in the tidymodels ecosystem is the ability to automatically generate ML code appropriate to a specific dataset.
library(usemodels)
use_xgboost(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train)
## many options available, like use_glmnet(), use_ranger(), etc
When a practitioner calls a function like use_xgboost(), they get code like the following, with feature engineering and modeling code specifically designed for their use case. It detects what kind of data they are using and chooses feature engineering steps based on this. It does not make all decisions for the practitioner, leaving important choices about the resampling strategy and parameter grid (i.e. experimental design choices) to the user.
xgboost_recipe <-
recipe(formula = Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = 0.01,
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(93056)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = ames_folds, grid = 7)
We have extensive support for automatic visualization methods, as well as the ability to customize all plots.
autoplot(xgboost_tune)
As it typical with xgboost, there are many combinations of parameters that give good results. Practitioners can find the best performing models.
show_best(xgboost_tune, metric = "rmse")
## # A tibble: 5 x 11
## trees min_n tree_depth loss_reduction sample_size .metric .estimator
## <int> <int> <int> <dbl> <dbl> <chr> <chr>
## 1 1899 36 7 0.0126 0.698 rmse standard
## 2 698 25 10 0.00000197 0.992 rmse standard
## 3 1480 32 14 0.000000000470 0.320 rmse standard
## 4 1348 19 3 0.000137 0.422 rmse standard
## 5 884 9 2 0.0305 0.160 rmse standard
## # ... with 4 more variables: mean <dbl>, n <int>, std_err <dbl>,
## # .config <chr>
These model parameters can be used to update/finalize the original workflow, which can then be deployed via an API or other method.
best_params <- select_best(xgboost_tune)
xgb_fitted <- finalize_workflow(xgboost_workflow, best_params) %>%
last_fit(ames_split)
xgb_fitted
## # Resampling results
## # Manual resampling
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [2.~ train/te~ <tibble [2~ <tibble ~ <tibble [731 ~ <workflo~
To see examples of using tidymodels for modeling and ML, check out these screencasts.