Tidymodel_Model building

setwd("D:/Class Materials & Work/Summer 2020 practice/1_Tidymodel_Model_building")

We will specify and train models
Loading required packages

library(tidymodels)  # for the parsnip package, along with the rest of tidymodels

## -- Attaching packages -------------------------------------------------------------------------------- tidymodels 0.1.0 --

## v broom     0.5.6      v recipes   0.1.12
## v dials     0.0.6      v rsample   0.0.6 
## v dplyr     0.8.5      v tibble    3.0.1 
## v ggplot2   3.3.1      v tune      0.1.0 
## v infer     0.5.1      v workflows 0.1.1 
## v parsnip   0.1.1      v yardstick 0.0.6 
## v purrr     0.3.4

## -- Conflicts ----------------------------------------------------------------------------------- tidymodels_conflicts() --
## x purrr::discard()  masks scales::discard()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x ggplot2::margin() masks dials::margin()
## x recipes::step()   masks stats::step()

Helper packages

library(readr)       # for importing data

## 
## Attaching package: 'readr'

## The following object is masked from 'package:yardstick':
## 
##     spec

## The following object is masked from 'package:scales':
## 
##     col_factor

library(broom.mixed) # for converting bayesian models to tidy tibbles

## Warning: package 'broom.mixed' was built under R version 4.0.2

## Registered S3 methods overwritten by 'broom.mixed':
##   method         from 
##   augment.lme    broom
##   augment.merMod broom
##   glance.lme     broom
##   glance.merMod  broom
##   glance.stanreg broom
##   tidy.brmsfit   broom
##   tidy.gamlss    broom
##   tidy.lme       broom
##   tidy.merMod    broom
##   tidy.rjags     broom
##   tidy.stanfit   broom
##   tidy.stanreg   broom

## 
## Attaching package: 'broom.mixed'

## The following object is masked from 'package:broom':
## 
##     tidyMCMC

Read the sea urchin data to examine how three different feeding regimes affect the size of sea urchins over time.

urchins <-
  # Data were assembled for a tutorial 
  # at https://www.flutterbys.com.au/stats/tut/tut7.5a.html
  read_csv("https://tidymodels.org/start/models/urchins.csv") %>% 
  # Change the names to be a little more verbose
  setNames(c("food_regime", "initial_volume", "width")) %>% 
  # Factors are very helpful for modeling, so we convert one column
  mutate(food_regime = factor(food_regime, levels = c("Initial", "Low", "High")))

## Parsed with column specification:
## cols(
##   TREAT = col_character(),
##   IV = col_double(),
##   SUTW = col_double()
## )

check the dataset

urchins

## # A tibble: 72 x 3
##    food_regime initial_volume width
##    <fct>                <dbl> <dbl>
##  1 Initial                3.5 0.01 
##  2 Initial                5   0.02 
##  3 Initial                8   0.061
##  4 Initial               10   0.051
##  5 Initial               13   0.041
##  6 Initial               13   0.061
##  7 Initial               15   0.041
##  8 Initial               15   0.071
##  9 Initial               16   0.092
## 10 Initial               17   0.051
## # ... with 62 more rows

As a first step in modeling, it’s always a good idea to plot the data:

ggplot(urchins,
       aes(x = initial_volume, 
           y = width, 
           group = food_regime, 
           col = food_regime)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE) +
  scale_color_viridis_d(option = "plasma", end = .7)

## `geom_smooth()` using formula 'y ~ x'

We can see that urchins that were larger in volume at the start of the experiment tended to have wider sutures at the end, but the slopes of the lines look different so this effect may depend on the feeding regime condition.

We will be using two-way ANOVA to analyze the dataset while using linear regression as the predictive framework.
We will specify the functional form of the mode, as well as specify the engine for model fitting and training.

lm_mod <- 
  linear_reg() %>% 
  set_engine("lm")

#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm

From here, the model can be estimated or trained using the “fit()” function

lm_fit <- 
  lm_mod %>% 
  fit(width ~ initial_volume * food_regime, data = urchins)

we use “tidy()” to provides the summary results in a more predictable and useful format.

tidy(lm_fit)

## # A tibble: 6 x 5
##   term                            estimate std.error statistic  p.value
##   <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                     0.0331    0.00962      3.44  0.00100 
## 2 initial_volume                  0.00155   0.000398     3.91  0.000222
## 3 food_regimeLow                  0.0198    0.0130       1.52  0.133   
## 4 food_regimeHigh                 0.0214    0.0145       1.47  0.145   
## 5 initial_volume:food_regimeLow  -0.00126   0.000510    -2.47  0.0162  
## 6 initial_volume:food_regimeHigh  0.000525  0.000702     0.748 0.457

The “lm_fit” has linear regression model built-in.
Suppose that, for a publication, it would be particularly interesting to make a plot of the mean body size for urchins that started the experiment with an initial volume of 20ml.
Basically, we are going to try predicting the value of body width with different food regime and same initial value of 20.
To create such a graph, we start with some new example data that we will make predictions for, to show in our graph:

new_points <- expand.grid(initial_volume = 20, 
                          food_regime = c("Initial", "Low", "High"))
new_points

##   initial_volume food_regime
## 1             20     Initial
## 2             20         Low
## 3             20        High

To get our predicted results, we can use the “predict()” function to find the mean values at 20ml. First, let’s generate the mean body width values:

mean_pred <- predict(lm_fit, new_data = new_points)
mean_pred

## # A tibble: 3 x 1
##    .pred
##    <dbl>
## 1 0.0642
## 2 0.0588
## 3 0.0961

Second, let’s generate the confidence interval body width values:

conf_int_pred <- predict(lm_fit, 
                         new_data = new_points, 
                         type = "conf_int")
conf_int_pred

## # A tibble: 3 x 2
##   .pred_lower .pred_upper
##         <dbl>       <dbl>
## 1      0.0555      0.0729
## 2      0.0499      0.0678
## 3      0.0870      0.105

Now combine:

plot_data <- 
  new_points %>% 
  bind_cols(mean_pred) %>% 
  bind_cols(conf_int_pred)

and plot:

ggplot(plot_data, aes(x = food_regime)) + 
  geom_point(aes(y = .pred)) + 
  geom_errorbar(aes(ymin = .pred_lower, 
                    ymax = .pred_upper),
                width = .2) + 
  labs(y = "urchin size")

We will try Bayesian modeling to compare our results. We are interested in knowing if the results would be different if the model was estimated using Bayesian approach.
A prior distribution needs to be declared for each model parameter.
set the prior distribution

prior_dist <- rstanarm::student_t(df = 1)

We can use set.seed() to ensure that the same (pseudo-)random numbers are generated each time we make this parsnip model.

set.seed(123) 
bayes_mod <-   
  linear_reg() %>% 
  set_engine("stan", 
             prior_intercept = prior_dist, 
             prior = prior_dist)

train the model

bayes_fit <- 
  bayes_mod %>% 
  fit(width ~ initial_volume * food_regime, data = urchins)

print(bayes_fit, digits = 5)

## parsnip model object
## 
## Fit time:  3.5s 
## stan_glm
##  family:       gaussian [identity]
##  formula:      width ~ initial_volume * food_regime
##  observations: 72
##  predictors:   6
## ------
##                                Median   MAD_SD  
## (Intercept)                     0.03472  0.00922
## initial_volume                  0.00150  0.00039
## food_regimeLow                  0.01775  0.01293
## food_regimeHigh                 0.01907  0.01379
## initial_volume:food_regimeLow  -0.00119  0.00051
## initial_volume:food_regimeHigh  0.00061  0.00066
## 
## Auxiliary parameter(s):
##       Median  MAD_SD 
## sigma 0.02132 0.00181
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

To update the parameter table, the tidy() method is once again used:

tidy(bayes_fit, conf.int = TRUE) #interfaces to common tasks are standardized.

## # A tibble: 6 x 5
##   term                            estimate std.error  conf.low conf.high
##   <chr>                              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)                     0.0347    0.00922   0.0188    0.0499  
## 2 initial_volume                  0.00150   0.000390  0.000858  0.00216 
## 3 food_regimeLow                  0.0178    0.0129   -0.00370   0.0378  
## 4 food_regimeHigh                 0.0191    0.0138   -0.00288   0.0426  
## 5 initial_volume:food_regimeLow  -0.00119   0.000511 -0.00199  -0.000356
## 6 initial_volume:food_regimeHigh  0.000608  0.000660 -0.000516  0.00171

Prepare data to make the graph.

bayes_plot_data <- 
  new_points %>% 
  bind_cols(predict(bayes_fit, new_data = new_points)) %>% 
  bind_cols(predict(bayes_fit, new_data = new_points, type = "conf_int"))

Now plot!

ggplot(bayes_plot_data, aes(x = food_regime)) + 
  geom_point(aes(y = .pred)) + 
  geom_errorbar(aes(ymin = .pred_lower, ymax = .pred_upper), width = .2) + 
  labs(y = "urchin size") + 
  ggtitle("Bayesian model with t(1) prior distribution")

This isn’t very different from the non-Bayesian results (except in interpretation).

Tidymodel framework allows us to tune a model step-by-step rather than jumping from start to finish in one go.
In creating a model, the tidyverse pipe (%>%) works well because all of the functions take the data as the first argument. For example,

urchins %>% 
  group_by(food_regime) %>% 
  summarize(med_vol = median(initial_volume))

## # A tibble: 3 x 2
##   food_regime med_vol
##   <fct>         <dbl>
## 1 Initial        20.5
## 2 Low            19.2
## 3 High           15

whereas the modeling code uses the pipe to pass around the model object:

bayes_mod %>% 
  fit(width ~ initial_volume * food_regime, data = urchins)

## parsnip model object
## 
## Fit time:  4.1s 
## stan_glm
##  family:       gaussian [identity]
##  formula:      width ~ initial_volume * food_regime
##  observations: 72
##  predictors:   6
## ------
##                                Median MAD_SD
## (Intercept)                    0.0    0.0   
## initial_volume                 0.0    0.0   
## food_regimeLow                 0.0    0.0   
## food_regimeHigh                0.0    0.0   
## initial_volume:food_regimeLow  0.0    0.0   
## initial_volume:food_regimeHigh 0.0    0.0   
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 0.0    0.0   
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

It is very similar to how ggplot operates,

ggplot(urchins,
       aes(initial_volume, width)) +      # returns a ggplot object 
  geom_jitter() +                         # same
  geom_smooth(method = lm, se = FALSE) +  # same                    
  labs(x = "Volume", y = "Width")         # etc

## `geom_smooth()` using formula 'y ~ x'

Reference: Build a Model

Tidymodel_Model building

Tarid Wongvorachan

July 4th, 2020