setwd("D:/Class Materials & Work/Summer 2020 practice/1_Tidymodel_Model_building")
We will specify and train models
Loading required packages
library(tidymodels) # for the parsnip package, along with the rest of tidymodels
## -- Attaching packages -------------------------------------------------------------------------------- tidymodels 0.1.0 --
## v broom 0.5.6 v recipes 0.1.12
## v dials 0.0.6 v rsample 0.0.6
## v dplyr 0.8.5 v tibble 3.0.1
## v ggplot2 3.3.1 v tune 0.1.0
## v infer 0.5.1 v workflows 0.1.1
## v parsnip 0.1.1 v yardstick 0.0.6
## v purrr 0.3.4
## -- Conflicts ----------------------------------------------------------------------------------- tidymodels_conflicts() --
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x ggplot2::margin() masks dials::margin()
## x recipes::step() masks stats::step()
Helper packages
library(readr) # for importing data
##
## Attaching package: 'readr'
## The following object is masked from 'package:yardstick':
##
## spec
## The following object is masked from 'package:scales':
##
## col_factor
library(broom.mixed) # for converting bayesian models to tidy tibbles
## Warning: package 'broom.mixed' was built under R version 4.0.2
## Registered S3 methods overwritten by 'broom.mixed':
## method from
## augment.lme broom
## augment.merMod broom
## glance.lme broom
## glance.merMod broom
## glance.stanreg broom
## tidy.brmsfit broom
## tidy.gamlss broom
## tidy.lme broom
## tidy.merMod broom
## tidy.rjags broom
## tidy.stanfit broom
## tidy.stanreg broom
##
## Attaching package: 'broom.mixed'
## The following object is masked from 'package:broom':
##
## tidyMCMC
Read the sea urchin data to examine how three different feeding regimes affect the size of sea urchins over time.
urchins <-
# Data were assembled for a tutorial
# at https://www.flutterbys.com.au/stats/tut/tut7.5a.html
read_csv("https://tidymodels.org/start/models/urchins.csv") %>%
# Change the names to be a little more verbose
setNames(c("food_regime", "initial_volume", "width")) %>%
# Factors are very helpful for modeling, so we convert one column
mutate(food_regime = factor(food_regime, levels = c("Initial", "Low", "High")))
## Parsed with column specification:
## cols(
## TREAT = col_character(),
## IV = col_double(),
## SUTW = col_double()
## )
check the dataset
urchins
## # A tibble: 72 x 3
## food_regime initial_volume width
## <fct> <dbl> <dbl>
## 1 Initial 3.5 0.01
## 2 Initial 5 0.02
## 3 Initial 8 0.061
## 4 Initial 10 0.051
## 5 Initial 13 0.041
## 6 Initial 13 0.061
## 7 Initial 15 0.041
## 8 Initial 15 0.071
## 9 Initial 16 0.092
## 10 Initial 17 0.051
## # ... with 62 more rows
As a first step in modeling, it’s always a good idea to plot the data:
ggplot(urchins,
aes(x = initial_volume,
y = width,
group = food_regime,
col = food_regime)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
scale_color_viridis_d(option = "plasma", end = .7)
## `geom_smooth()` using formula 'y ~ x'
We can see that urchins that were larger in volume at the start of the experiment tended to have wider sutures at the end, but the slopes of the lines look different so this effect may depend on the feeding regime condition.
We will be using two-way ANOVA to analyze the dataset while using linear regression as the predictive framework.
We will specify the functional form of the mode, as well as specify the engine for model fitting and training.
lm_mod <-
linear_reg() %>%
set_engine("lm")
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
From here, the model can be estimated or trained using the “fit()” function
lm_fit <-
lm_mod %>%
fit(width ~ initial_volume * food_regime, data = urchins)
we use “tidy()” to provides the summary results in a more predictable and useful format.
tidy(lm_fit)
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0331 0.00962 3.44 0.00100
## 2 initial_volume 0.00155 0.000398 3.91 0.000222
## 3 food_regimeLow 0.0198 0.0130 1.52 0.133
## 4 food_regimeHigh 0.0214 0.0145 1.47 0.145
## 5 initial_volume:food_regimeLow -0.00126 0.000510 -2.47 0.0162
## 6 initial_volume:food_regimeHigh 0.000525 0.000702 0.748 0.457
The “lm_fit” has linear regression model built-in.
Suppose that, for a publication, it would be particularly interesting to make a plot of the mean body size for urchins that started the experiment with an initial volume of 20ml.
Basically, we are going to try predicting the value of body width with different food regime and same initial value of 20.
To create such a graph, we start with some new example data that we will make predictions for, to show in our graph:
new_points <- expand.grid(initial_volume = 20,
food_regime = c("Initial", "Low", "High"))
new_points
## initial_volume food_regime
## 1 20 Initial
## 2 20 Low
## 3 20 High
To get our predicted results, we can use the “predict()” function to find the mean values at 20ml. First, let’s generate the mean body width values:
mean_pred <- predict(lm_fit, new_data = new_points)
mean_pred
## # A tibble: 3 x 1
## .pred
## <dbl>
## 1 0.0642
## 2 0.0588
## 3 0.0961
Second, let’s generate the confidence interval body width values:
conf_int_pred <- predict(lm_fit,
new_data = new_points,
type = "conf_int")
conf_int_pred
## # A tibble: 3 x 2
## .pred_lower .pred_upper
## <dbl> <dbl>
## 1 0.0555 0.0729
## 2 0.0499 0.0678
## 3 0.0870 0.105
Now combine:
plot_data <-
new_points %>%
bind_cols(mean_pred) %>%
bind_cols(conf_int_pred)
and plot:
ggplot(plot_data, aes(x = food_regime)) +
geom_point(aes(y = .pred)) +
geom_errorbar(aes(ymin = .pred_lower,
ymax = .pred_upper),
width = .2) +
labs(y = "urchin size")
We will try Bayesian modeling to compare our results. We are interested in knowing if the results would be different if the model was estimated using Bayesian approach.
A prior distribution needs to be declared for each model parameter.
set the prior distribution
prior_dist <- rstanarm::student_t(df = 1)
We can use set.seed() to ensure that the same (pseudo-)random numbers are generated each time we make this parsnip model.
set.seed(123)
bayes_mod <-
linear_reg() %>%
set_engine("stan",
prior_intercept = prior_dist,
prior = prior_dist)
train the model
bayes_fit <-
bayes_mod %>%
fit(width ~ initial_volume * food_regime, data = urchins)
print(bayes_fit, digits = 5)
## parsnip model object
##
## Fit time: 3.5s
## stan_glm
## family: gaussian [identity]
## formula: width ~ initial_volume * food_regime
## observations: 72
## predictors: 6
## ------
## Median MAD_SD
## (Intercept) 0.03472 0.00922
## initial_volume 0.00150 0.00039
## food_regimeLow 0.01775 0.01293
## food_regimeHigh 0.01907 0.01379
## initial_volume:food_regimeLow -0.00119 0.00051
## initial_volume:food_regimeHigh 0.00061 0.00066
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 0.02132 0.00181
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
To update the parameter table, the tidy() method is once again used:
tidy(bayes_fit, conf.int = TRUE) #interfaces to common tasks are standardized.
## # A tibble: 6 x 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0347 0.00922 0.0188 0.0499
## 2 initial_volume 0.00150 0.000390 0.000858 0.00216
## 3 food_regimeLow 0.0178 0.0129 -0.00370 0.0378
## 4 food_regimeHigh 0.0191 0.0138 -0.00288 0.0426
## 5 initial_volume:food_regimeLow -0.00119 0.000511 -0.00199 -0.000356
## 6 initial_volume:food_regimeHigh 0.000608 0.000660 -0.000516 0.00171
Prepare data to make the graph.
bayes_plot_data <-
new_points %>%
bind_cols(predict(bayes_fit, new_data = new_points)) %>%
bind_cols(predict(bayes_fit, new_data = new_points, type = "conf_int"))
Now plot!
ggplot(bayes_plot_data, aes(x = food_regime)) +
geom_point(aes(y = .pred)) +
geom_errorbar(aes(ymin = .pred_lower, ymax = .pred_upper), width = .2) +
labs(y = "urchin size") +
ggtitle("Bayesian model with t(1) prior distribution")
This isn’t very different from the non-Bayesian results (except in interpretation).
Tidymodel framework allows us to tune a model step-by-step rather than jumping from start to finish in one go.
In creating a model, the tidyverse pipe (%>%) works well because all of the functions take the data as the first argument. For example,
urchins %>%
group_by(food_regime) %>%
summarize(med_vol = median(initial_volume))
## # A tibble: 3 x 2
## food_regime med_vol
## <fct> <dbl>
## 1 Initial 20.5
## 2 Low 19.2
## 3 High 15
whereas the modeling code uses the pipe to pass around the model object:
bayes_mod %>%
fit(width ~ initial_volume * food_regime, data = urchins)
## parsnip model object
##
## Fit time: 4.1s
## stan_glm
## family: gaussian [identity]
## formula: width ~ initial_volume * food_regime
## observations: 72
## predictors: 6
## ------
## Median MAD_SD
## (Intercept) 0.0 0.0
## initial_volume 0.0 0.0
## food_regimeLow 0.0 0.0
## food_regimeHigh 0.0 0.0
## initial_volume:food_regimeLow 0.0 0.0
## initial_volume:food_regimeHigh 0.0 0.0
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 0.0 0.0
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
It is very similar to how ggplot operates,
ggplot(urchins,
aes(initial_volume, width)) + # returns a ggplot object
geom_jitter() + # same
geom_smooth(method = lm, se = FALSE) + # same
labs(x = "Volume", y = "Width") # etc
## `geom_smooth()` using formula 'y ~ x'
Reference: Build a Model