Build a model

Introduction
The Sea Urchins Data
Build and fit a model
Use a model to predict
Model with a different engine

Get started with building a model in this R Markdown document that accompanies Build a model tidymodels start article.

If you ever get lost, you can visit the links provided next to section headers to see the accompanying section in the online article.

Take advantage of the RStudio IDE and use “Run All Chunks Above” or “Run Current Chunk” buttons to easily execute code chunks. If you have been running other tidymodels articles in this project, restart R before working on this article so you don’t run out of memory on RStudio Cloud.

Introduction

Load necessary packages:

library(tidymodels)  # for the parsnip package, along with the rest of tidymodels

# Helper packages
library(readr)       # for importing data
library(broom.mixed) # for converting bayesian models to tidy tibbles

The Sea Urchins Data

urchins <-
  # Data were assembled for a tutorial 
  # at https://www.flutterbys.com.au/stats/tut/tut7.5a.html
  read_csv("https://tidymodels.org/start/models/urchins.csv") %>% 
  # Change the names to be a little more verbose
  setNames(c("food_regime", "initial_volume", "width")) %>% 
  # Factors are very helpful for modeling, so we convert one column
  mutate(food_regime = factor(food_regime, levels = c("Initial", "Low", "High")))

Look at the data:

urchins

## # A tibble: 72 x 3
##   food_regime initial_volume width
##   <fct>                <dbl> <dbl>
## 1 Initial                3.5 0.01 
## 2 Initial                5   0.02 
## 3 Initial                8   0.061
## 4 Initial               10   0.051
## 5 Initial               13   0.041
## # … with 67 more rows

Plot the data:

ggplot(urchins,
       aes(x = initial_volume, 
           y = width, 
           group = food_regime, 
           col = food_regime)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE) +
  scale_color_viridis_d(option = "plasma", end = .7)

Build and fit a model

linear_reg() %>% 
  set_engine("lm")

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Try typing ?linear_reg() in the console to see all available engines and other details about this model type.

Create model specification:

lm_mod <- 
  linear_reg() %>% 
  set_engine("lm")

Fit model:

lm_fit <- 
  lm_mod %>% 
  fit(width ~ initial_volume * food_regime, data = urchins)
lm_fit

## parsnip model object
## 
## Fit time:  2ms 
## 
## Call:
## stats::lm(formula = width ~ initial_volume * food_regime, data = data)
## 
## Coefficients:
##                    (Intercept)                  initial_volume  
##                      0.0331216                       0.0015546  
##                 food_regimeLow                 food_regimeHigh  
##                      0.0197824                       0.0214111  
##  initial_volume:food_regimeLow  initial_volume:food_regimeHigh  
##                     -0.0012594                       0.0005254

Present model results in a tidyverse friendly way with tidy() from broom package.

tidy(lm_fit)

## # A tibble: 6 x 5
##   term                            estimate std.error statistic  p.value
##   <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                     0.0331    0.00962      3.44  0.00100 
## 2 initial_volume                  0.00155   0.000398     3.91  0.000222
## 3 food_regimeLow                  0.0198    0.0130       1.52  0.133   
## 4 food_regimeHigh                 0.0214    0.0145       1.47  0.145   
## 5 initial_volume:food_regimeLow  -0.00126   0.000510    -2.47  0.0162  
## 6 initial_volume:food_regimeHigh  0.000525  0.000702     0.748 0.457

Use a model to predict

New example data to predict:

new_points <- expand.grid(initial_volume = 20, 
                          food_regime = c("Initial", "Low", "High"))
new_points

##   initial_volume food_regime
## 1             20     Initial
## 2             20         Low
## 3             20        High

Generate the mean body width values:

mean_pred <- predict(lm_fit, new_data = new_points)
mean_pred

## # A tibble: 3 x 1
##    .pred
##    <dbl>
## 1 0.0642
## 2 0.0588
## 3 0.0961

Get confidence intervals and plot:

conf_int_pred <- predict(lm_fit, 
                         new_data = new_points, 
                         type = "conf_int")
conf_int_pred

## # A tibble: 3 x 2
##   .pred_lower .pred_upper
##         <dbl>       <dbl>
## 1      0.0555      0.0729
## 2      0.0499      0.0678
## 3      0.0870      0.105

# Now combine: 
plot_data <- 
  new_points %>% 
  bind_cols(mean_pred) %>% 
  bind_cols(conf_int_pred)

# and plot:
ggplot(plot_data, aes(x = food_regime)) + 
  geom_point(aes(y = .pred)) + 
  geom_errorbar(aes(ymin = .pred_lower, 
                    ymax = .pred_upper),
                width = .2) + 
  labs(y = "urchin size")

Model with a different engine

Switch to Bayesian approach by simply changing your engine to stan:

# set the prior distribution
prior_dist <- rstanarm::student_t(df = 1)

set.seed(123)

# make the parsnip model
bayes_mod <-   
  linear_reg() %>% 
  set_engine("stan", 
             prior_intercept = prior_dist, 
             prior = prior_dist) 

# train the model
bayes_fit <- 
  bayes_mod %>% 
  fit(width ~ initial_volume * food_regime, data = urchins)

print(bayes_fit, digits = 5)

## parsnip model object
## 
## Fit time:  11.1s 
## stan_glm
##  family:       gaussian [identity]
##  formula:      width ~ initial_volume * food_regime
##  observations: 72
##  predictors:   6
## ------
##                                Median   MAD_SD  
## (Intercept)                     0.03286  0.00956
## initial_volume                  0.00156  0.00041
## food_regimeLow                  0.01968  0.01267
## food_regimeHigh                 0.02160  0.01425
## initial_volume:food_regimeLow  -0.00125  0.00050
## initial_volume:food_regimeHigh  0.00053  0.00071
## 
## Auxiliary parameter(s):
##       Median  MAD_SD 
## sigma 0.02122 0.00185
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

To update the parameter table, the tidy() method is once again used:

tidy(bayes_fit, intervals = TRUE)

## # A tibble: 6 x 3
##   term                            estimate std.error
##   <chr>                              <dbl>     <dbl>
## 1 (Intercept)                     0.0329    0.00956 
## 2 initial_volume                  0.00156   0.000408
## 3 food_regimeLow                  0.0197    0.0127  
## 4 food_regimeHigh                 0.0216    0.0143  
## 5 initial_volume:food_regimeLow  -0.00125   0.000499
## 6 initial_volume:food_regimeHigh  0.000527  0.000707

Get your predictions without changing the syntax you used earlier:

bayes_plot_data <- 
  new_points %>% 
  bind_cols(predict(bayes_fit, new_data = new_points)) %>% 
  bind_cols(predict(bayes_fit, new_data = new_points, type = "conf_int"))

ggplot(bayes_plot_data, aes(x = food_regime)) + 
  geom_point(aes(y = .pred)) + 
  geom_errorbar(aes(ymin = .pred_lower, ymax = .pred_upper), width = .2) + 
  labs(y = "urchin size") + 
  ggtitle("Bayesian model with t(1) prior distribution")

Think about how we are using the pipe (%>%):

Use the pipe to pass around the data in the tidyverse
Use the pipe to pass around the model object with tidymodels