Fitting models with parsnip

The parsnip package provides a fluent and standardized interface for a variety of different models. In this chapter, we both give some motiavation for why a common interface is beneficial and show how the use the package.

Create a model

Once the data have been encoded in a format ready for a modeling algorithm, such as numeric matrix, they can be used in the model building process.

Suppose we will use the linear regression. There are a variety of methods that can be used to estimate the model parameters:

library(tidymodels)
## -- Attaching packages --------
## v broom     0.7.0      v recipes   0.1.13
## v dials     0.0.8      v rsample   0.0.7 
## v dplyr     1.0.0      v tibble    3.0.3 
## v ggplot2   3.3.2      v tidyr     1.1.0 
## v infer     0.5.3      v tune      0.1.1 
## v modeldata 0.0.2      v workflows 0.1.2 
## v parsnip   0.1.2      v yardstick 0.0.7 
## v purrr     0.3.4
## -- Conflicts -----------------
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()
setwd('C:/Users/DellPC/Desktop/Corner/R_source_code/Julia_Silge/tidy_model_R_book')

ames <- read.csv('ames.csv')

ames_split <- initial_split(ames, prob= 0.80)
ames_split
## <Analysis/Assess/Total>
## <2198/732/2930>
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
library(tidymodels)
library(tidyverse)
## -- Attaching packages --------
## v readr   1.3.1     v forcats 0.5.0
## v stringr 1.4.0
## -- Conflicts -----------------
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard()    masks scales::discard()
## x dplyr::filter()     masks stats::filter()
## x stringr::fixed()    masks recipes::fixed()
## x dplyr::lag()        masks stats::lag()
## x readr::spec()       masks yardstick::spec()
#model <- lm(formula, data)

where .. symbolizes other options to pass to lm(). The function does not have an x/y interface, where we might pass in our outcome as y and our predictors as x

To estimate with regularization, a Bayensian model can be fit using the rstanarm package:

#model <- stan_glm(formula, data, family = 'gaussian')

In this case, the other options passess via ... would include arguments for the prior distributions of the parameters as well as specifics about the numerical aspects of the model. As with lm(), only the formula interface is available.

A popular non-Bayesian approach to regularized regression is the glmnet model.

#model <- glmnet(x=matrix, y = vector, family = 'gausian')

For tidymodels, the approach to specifying a model í interded to be more unified:

1. Specify the type of model based on its mathematical structure (e.g: linear regression, random forest K-nearest neighbors, etc)

2. Specify the engine for fitting the model. Most often this reflects the software package that should be used.

3. When required, declare the mode of the model The mode reflects the type of prediction outcome. FOr numeric outcomes, the mode is regression. for qualitative outcomes, it is classification. If a model can only create one type of model, such as linear regression, the mode is already set.

These specifications are built without referencing the data. For example, for the three cases above:

linear_reg() %>% set_engine('lm')
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
linear_reg() %>% set_engine('glmnet')
## Linear Regression Model Specification (regression)
## 
## Computational engine: glmnet
linear_reg() %>% set_engine('stan')
## Linear Regression Model Specification (regression)
## 
## Computational engine: stan

Once the details of the model have been specified, the model estimation can be done with either the fit() function (to use formula) or the fit_xy() function (when your data are already pre-processed). The parsnip** package allows the user to be indifferent to the interface of the underlying model; you can always use a formula even if the modeling package’s function ony has the x/y interface.

The translate() function can provide details on how parsnip converts the user’s code to the package’s syntax.

linear_reg() %>% set_engine('lm') %>% translate()
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm 
## 
## Model fit template:
## stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
linear_reg() %>% set_engine('glmnet') %>% translate()
## Linear Regression Model Specification (regression)
## 
## Computational engine: glmnet 
## 
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     family = "gaussian")
linear_reg() %>% set_engine('stan') %>% translate()
## Linear Regression Model Specification (regression)
## 
## Computational engine: stan 
## 
## Model fit template:
## rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), 
##     weights = missing_arg(), family = stats::gaussian, refresh = 0)

Note that missing_arg() is just a placeholder for the data has yet to be provided.

Note that, for the Stand and glmnet engines` the family argument was automatically added as a default. As will be shown below, this option can be charged.

Let’s walk through how to predict the sale price of houses in the Ames data as a function of only longitude and latitude. This code builds on the current analysis shown in Section 6.11:

lm_model <- linear_reg() %>%
               set_engine('lm')


lm_form_fit <- lm_model %>%
               #Recall that Sale_Price has been pre-logged
               fit(Sale_Price ~ Longitude + Latitude, data = ames_train)


lm_xy_fit <- lm_model  %>%
               fit_xy(
                              x = ames_train %>% select(Longitude, Latitude),
                              y = ames_train %>% pull(Sale_Price)
               )
lm_form_fit
## parsnip model object
## 
## Fit time:  0ms 
## 
## Call:
## stats::lm(formula = Sale_Price ~ Longitude + Latitude, data = data)
## 
## Coefficients:
## (Intercept)    Longitude     Latitude  
##    -310.312       -2.121        2.781
lm_xy_fit
## parsnip model object
## 
## Fit time:  0ms 
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
## (Intercept)    Longitude     Latitude  
##    -310.312       -2.121        2.781

The differences are between fit() and fit_xy() may not be obvious.

When fit() is used with a model specification, this almost always means that dummy variables will be created from qualitative predictors. If the underlying function requires a matrix (like glmnet), it will make them. However, if the underlying function uses a formula, fit() just passes the formula to that function. We estimate that 99% of modeling functions using formulas make dummy variables. The other 1% include tree-based methods that do not require purely numeric predictors.

The fit_xy() function always passess the data as is to the underlying model function. It will not create dummy variables before doing so.

To understand how the parsnip argument names map to the original names, use the help file for the model (available via ?rand_forest) as well as the translate() function:

rand_forest(trees = 1000, min_n = 5) %>%
               set_engine('ranger') %>% 
               set_mode('regression') %>%
               translate()
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   trees = 1000
##   min_n = 5
## 
## Computational engine: ranger 
## 
## Model fit template:
## ranger::ranger(formula = missing_arg(), data = missing_arg(), 
##     case.weights = missing_arg(), num.trees = 1000, min.node.size = 5, 
##     num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 
##         1))

Modeling functions in parsnip separate model arguments into two categories

For example, in the translateion of the random forest code above, the argumetns num.threads verbose and seed were added by default. These argumetns are specific to the ranger implementation of random forest models and wouldn;t make sense as main arguments. Engine-specific argumetns can be specified in set_engine(). For example, top have the ranger::ranger() function print out more information about the fit:

rand_forest(trees = 1000, min_n = 5) %>%
               set_engine('ranger', verbose = TRUE) %>% set_mode('regression')
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   trees = 1000
##   min_n = 5
## 
## Engine-Specific Arguments:
##   verbose = TRUE
## 
## Computational engine: ranger

Use the model results

Once the model is created and fit, we can use the results in a variety of ways; we might want to plot, print, or otherwise examine the model output. Several quantities are stored in a parsnip model object, including the fitted model. This can be found in an element called fit, which can be returned using the purrr::pluck() function:

lm_form_fit %>% pluck('fit')
## 
## Call:
## stats::lm(formula = Sale_Price ~ Longitude + Latitude, data = data)
## 
## Coefficients:
## (Intercept)    Longitude     Latitude  
##    -310.312       -2.121        2.781

Normal methods can be applied to this object, such as printing, plotting, and so on:

lm_form_fit %>% pluck('fit') %>% vcov()
##             (Intercept)     Longitude      Latitude
## (Intercept)  226.259812  1.7208485659 -1.5490508315
## Longitude      1.720849  0.0180036557 -0.0008309154
## Latitude      -1.549051 -0.0008309154  0.0350006256

One issue with some existing methods in base R is that the results are stored in a manner that may not be the most useful. For example, the summary() method for lm objects can be used used to print the results of the model fit, including a table with parameter values, their uncertainty estimates, and p-values. These particular results can also be saved:

model_res <- lm_form_fit %>% 
               pluck('fit') %>%
               summary()

# The model coefficient table is accessible via the `coef` method.
# 
param_est <- coef(model_res)

class(param_est)
## [1] "matrix" "array"
param_est
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -310.312080 15.0419351 -20.62980 1.433929e-86
## Longitude     -2.121388  0.1341777 -15.81029 2.088164e-53
## Latitude       2.780582  0.1870845  14.86270 1.106414e-47
tidy(lm_form_fit)
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -310.      15.0       -20.6 1.43e-86
## 2 Longitude      -2.12     0.134     -15.8 2.09e-53
## 3 Latitude        2.78     0.187      14.9 1.11e-47

Make Predictions

Another area where parsnip diverges from conventional R modeling function is the format of values returned from predict(). For predictions, parsnip always conforms to the following rules:

  1. The results are always a tibble
  2. The column names of the tibble are alwasy predictable
  3. There are always as many rows in the tibble as there are in the input data set.

For example, when numeric data are predicted:

ames_test_small <- ames_test %>% slice(1:5)

predict(lm_form_fit, new_data = ames_test_small)
## # A tibble: 5 x 1
##   .pred
##   <dbl>
## 1  5.22
## 2  5.21
## 3  5.29
## 4  5.27
## 5  5.27
ames_test_small %>% 
  select(Sale_Price) %>% 
  bind_cols(predict(lm_form_fit, ames_test_small)) %>% 
  # Add 95% prediction intervals to the results:
  bind_cols(predict(lm_form_fit, ames_test_small, type = "pred_int")) 
##   Sale_Price    .pred .pred_lower .pred_upper
## 1   5.235528 5.221863    4.904018    5.539709
## 2   5.387390 5.213546    4.895700    5.531392
## 3   5.291147 5.285889    4.968025    5.603753
## 4   5.282169 5.274930    4.957060    5.592800
## 5   5.267172 5.270237    4.952404    5.588070

Chapter Summary

library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ames_rec <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
           Latitude + Longitude, data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, Longitude, deg_free = 20)

lm_model <- linear_reg() %>% set_engine("lm")