Get started with building a model in this R Markdown document that accompanies Build a model tidymodels start article.
If you ever get lost, you can visit the links provided next to section headers to see the accompanying section in the online article.
Take advantage of the RStudio IDE and use “Run All Chunks Above” or “Run Current Chunk” buttons to easily execute code chunks. If you have been running other tidymodels articles in this project, restart R before working on this article so you don’t run out of memory on RStudio Cloud.
Load necessary packages:
library(tidymodels) # for the parsnip package, along with the rest of tidymodels
# Helper packages
library(readr) # for importing data
library(broom.mixed) # for converting bayesian models to tidy tibbles
urchins <-
# Data were assembled for a tutorial
# at https://www.flutterbys.com.au/stats/tut/tut7.5a.html
read_csv("https://tidymodels.org/start/models/urchins.csv") %>%
# Change the names to be a little more verbose
setNames(c("food_regime", "initial_volume", "width")) %>%
# Factors are very helpful for modeling, so we convert one column
mutate(food_regime = factor(food_regime, levels = c("Initial", "Low", "High")))
Look at the data:
urchins
## # A tibble: 72 x 3
## food_regime initial_volume width
## <fct> <dbl> <dbl>
## 1 Initial 3.5 0.01
## 2 Initial 5 0.02
## 3 Initial 8 0.061
## 4 Initial 10 0.051
## 5 Initial 13 0.041
## # … with 67 more rows
Plot the data:
ggplot(urchins,
aes(x = initial_volume,
y = width,
group = food_regime,
col = food_regime)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
scale_color_viridis_d(option = "plasma", end = .7)
linear_reg() %>%
set_engine("lm")
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Try typing ?linear_reg() in the console to see all
available engines and other details about this model type.
Create model specification:
lm_mod <-
linear_reg() %>%
set_engine("lm")
Fit model:
lm_fit <-
lm_mod %>%
fit(width ~ initial_volume * food_regime, data = urchins)
lm_fit
## parsnip model object
##
## Fit time: 2ms
##
## Call:
## stats::lm(formula = width ~ initial_volume * food_regime, data = data)
##
## Coefficients:
## (Intercept) initial_volume
## 0.0331216 0.0015546
## food_regimeLow food_regimeHigh
## 0.0197824 0.0214111
## initial_volume:food_regimeLow initial_volume:food_regimeHigh
## -0.0012594 0.0005254
Present model results in a tidyverse friendly way with
tidy() from broom package.
tidy(lm_fit)
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0331 0.00962 3.44 0.00100
## 2 initial_volume 0.00155 0.000398 3.91 0.000222
## 3 food_regimeLow 0.0198 0.0130 1.52 0.133
## 4 food_regimeHigh 0.0214 0.0145 1.47 0.145
## 5 initial_volume:food_regimeLow -0.00126 0.000510 -2.47 0.0162
## 6 initial_volume:food_regimeHigh 0.000525 0.000702 0.748 0.457
New example data to predict:
new_points <- expand.grid(initial_volume = 20,
food_regime = c("Initial", "Low", "High"))
new_points
## initial_volume food_regime
## 1 20 Initial
## 2 20 Low
## 3 20 High
Generate the mean body width values:
mean_pred <- predict(lm_fit, new_data = new_points)
mean_pred
## # A tibble: 3 x 1
## .pred
## <dbl>
## 1 0.0642
## 2 0.0588
## 3 0.0961
Get confidence intervals and plot:
conf_int_pred <- predict(lm_fit,
new_data = new_points,
type = "conf_int")
conf_int_pred
## # A tibble: 3 x 2
## .pred_lower .pred_upper
## <dbl> <dbl>
## 1 0.0555 0.0729
## 2 0.0499 0.0678
## 3 0.0870 0.105
# Now combine:
plot_data <-
new_points %>%
bind_cols(mean_pred) %>%
bind_cols(conf_int_pred)
# and plot:
ggplot(plot_data, aes(x = food_regime)) +
geom_point(aes(y = .pred)) +
geom_errorbar(aes(ymin = .pred_lower,
ymax = .pred_upper),
width = .2) +
labs(y = "urchin size")
Switch to Bayesian approach by simply changing your engine to stan:
# set the prior distribution
prior_dist <- rstanarm::student_t(df = 1)
set.seed(123)
# make the parsnip model
bayes_mod <-
linear_reg() %>%
set_engine("stan",
prior_intercept = prior_dist,
prior = prior_dist)
# train the model
bayes_fit <-
bayes_mod %>%
fit(width ~ initial_volume * food_regime, data = urchins)
print(bayes_fit, digits = 5)
## parsnip model object
##
## Fit time: 11.1s
## stan_glm
## family: gaussian [identity]
## formula: width ~ initial_volume * food_regime
## observations: 72
## predictors: 6
## ------
## Median MAD_SD
## (Intercept) 0.03286 0.00956
## initial_volume 0.00156 0.00041
## food_regimeLow 0.01968 0.01267
## food_regimeHigh 0.02160 0.01425
## initial_volume:food_regimeLow -0.00125 0.00050
## initial_volume:food_regimeHigh 0.00053 0.00071
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 0.02122 0.00185
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
To update the parameter table, the tidy() method is once
again used:
tidy(bayes_fit, intervals = TRUE)
## # A tibble: 6 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) 0.0329 0.00956
## 2 initial_volume 0.00156 0.000408
## 3 food_regimeLow 0.0197 0.0127
## 4 food_regimeHigh 0.0216 0.0143
## 5 initial_volume:food_regimeLow -0.00125 0.000499
## 6 initial_volume:food_regimeHigh 0.000527 0.000707
Get your predictions without changing the syntax you used earlier:
bayes_plot_data <-
new_points %>%
bind_cols(predict(bayes_fit, new_data = new_points)) %>%
bind_cols(predict(bayes_fit, new_data = new_points, type = "conf_int"))
ggplot(bayes_plot_data, aes(x = food_regime)) +
geom_point(aes(y = .pred)) +
geom_errorbar(aes(ymin = .pred_lower, ymax = .pred_upper), width = .2) +
labs(y = "urchin size") +
ggtitle("Bayesian model with t(1) prior distribution")
Think about how we are using the pipe (%>%):