BOOTSTRAP RESAMPLING WITH TIDYTUESDAY BEER PRODUCTION DATA

I’ve been publishing screencasts demonstrating how to use the tidymodels framwork, from first steps in modeling to how to tune more complex models. Today, I’m using this week;s TidyTuesday dataset on beer production to show how to use bootstrap resampling t oestimate model parameters

Explore the data

Our modeling goal here is to estiamte how much sugar beer producers use relative to malt according to the TidyTuesday dataset. We’ll use bootstrap resampling to do this!

First, let’s look at the data on brewing materials

library(tidyverse)

## -- Attaching packages -------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ----------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## -- Attaching packages ------------------- tidymodels 0.1.1 --

## v broom     0.7.0      v recipes   0.1.13
## v dials     0.0.8      v rsample   0.0.7 
## v infer     0.5.3      v tune      0.1.1 
## v modeldata 0.0.2      v workflows 0.1.2 
## v parsnip   0.1.2      v yardstick 0.0.7

## -- Conflicts ---------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()

library(ggthemes)

theme_set(theme_light())

brewing_materials_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewing_materials.csv")

## Parsed with column specification:
## cols(
##   data_type = col_character(),
##   material_type = col_character(),
##   year = col_double(),
##   month = col_double(),
##   type = col_character(),
##   month_current = col_double(),
##   month_prior_year = col_double(),
##   ytd_current = col_double(),
##   ytd_prior_year = col_double()
## )

brewing_materials_raw %>% 
               count(type, wt = month_current, sort = TRUE)

## # A tibble: 12 x 2
##    type                                 n
##    <chr>                            <dbl>
##  1 Total Used                 53559516695
##  2 Total Grain products       44734903124
##  3 Malt and malt products     32697313882
##  4 Total Non-Grain products    8824613571
##  5 Sugar and syrups            6653104081
##  6 Rice and rice products      5685742541
##  7 Corn and corn products      5207759409
##  8 Hops (dry)                  1138840132
##  9 Other                        998968470
## 10 Barley and barley products   941444745
## 11 Wheat and wheat products     202642547
## 12 Hops (used as extracts)       33700888

How have some different brewing materials changed over time?

brewing_filtered <- brewing_materials_raw %>% 
               filter(
                              type %in% c(
                                             'Malt and malt products',
                                             'Sugar and syrups',
                                             'Hops (dry)'
                              ),
                              year < 2016,
                              !(month == 12 & year %in% 2014:2015)
                              ) %>% 
               mutate(
                              date = paste0(year, '-', month, '-01'),
                              date = lubridate::ymd(date)
               )
               

brewing_filtered %>% 
               ggplot(aes(date, month_current, color = type)) + geom_point()

There are strong annual patterns in these materials. We want to measure how much sugar beer producers use relative to malt

brewing_materials <- brewing_filtered %>% 
               select(date, type, month_current) %>% 
               pivot_wider(
                              names_from = type,
                              values_from = month_current) %>% 
               janitor::clean_names()
   
brewing_materials

## # A tibble: 94 x 4
##    date       malt_and_malt_products sugar_and_syrups hops_dry
##    <date>                      <dbl>            <dbl>    <dbl>
##  1 2008-01-01              374165152         78358212  4506546
##  2 2008-02-01              355687578         80188744  1815271
##  3 2008-03-01              399855819         78907213  6067167
##  4 2008-04-01              388639443         81199989  6864440
##  5 2008-05-01              411307544         89946309  7470130
##  6 2008-06-01              415161326         81012422  7361941
##  7 2008-07-01              405393784         76728131  1759452
##  8 2008-08-01              389391266         83928121  5992025
##  9 2008-09-01              362587470         71982604  3788942
## 10 2008-10-01              353803777         42828943  3788949
## # ... with 84 more rows

brewing_materials %>% 
               ggplot(aes( x= malt_and_malt_products, y = sugar_and_syrups)) + geom_smooth(method = 'lm') + geom_point()

## `geom_smooth()` using formula 'y ~ x'

There is a lot of variation in this relationship, but beer reproducers use more sugar when they use more malt. What is the relationship?

library(tidymodels)

beer_fit <- lm(data = brewing_materials, sugar_and_syrups ~ 0 + + malt_and_malt_products)

summary(beer_fit)

## 
## Call:
## lm(formula = sugar_and_syrups ~ 0 + +malt_and_malt_products, 
##     data = brewing_materials)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -29985291  -6468052    174001   7364462  23462837 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## malt_and_malt_products 0.205804   0.003446   59.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11480000 on 93 degrees of freedom
## Multiple R-squared:  0.9746, Adjusted R-squared:  0.9743 
## F-statistic:  3567 on 1 and 93 DF,  p-value: < 2.2e-16

tidy(beer_fit)

## # A tibble: 1 x 5
##   term                   estimate std.error statistic  p.value
##   <chr>                     <dbl>     <dbl>     <dbl>    <dbl>
## 1 malt_and_malt_products    0.206   0.00345      59.7 5.72e-76

Here I am choosing to set the intercept to zero to take a simplified view of the malt-sugar relationship (i.e beer producers don’t use any sugar if they aren’t starting with malt). We could leave that off and estimate bot han intercept (baseline use of sugar all the time) and slope (increase in use of sugar per barrel of malt).

This model and the visualization above are based on model assumptions that may not hold with our real world beer production data. Bootstrap resampling provides predictions and confidence intervals that are more robust

Bootstrap resampling

First, let’s create a set of bootstrap resamples

set.seed(123)
beer_boot <- bootstraps(brewing_materials, times = 1e3, apparent = TRUE)

beer_boot

## # Bootstrap sampling with apparent sample 
## # A tibble: 1,001 x 2
##    splits          id           
##    <list>          <chr>        
##  1 <split [94/39]> Bootstrap0001
##  2 <split [94/34]> Bootstrap0002
##  3 <split [94/37]> Bootstrap0003
##  4 <split [94/39]> Bootstrap0004
##  5 <split [94/29]> Bootstrap0005
##  6 <split [94/27]> Bootstrap0006
##  7 <split [94/35]> Bootstrap0007
##  8 <split [94/33]> Bootstrap0008
##  9 <split [94/29]> Bootstrap0009
## 10 <split [94/34]> Bootstrap0010
## # ... with 991 more rows

Next, let’s train a model to each of these bootstrap samples. We can use tidy() with map() to create a dataframe of model results

beer_models <- beer_boot %>% 
               mutate(
                              model = map(splits, ~ lm(sugar_and_syrups ~ 0 + malt_and_malt_products, data = .)),
                              coef_info = map(model, tidy)
               )
               
beer_coefs <- beer_models %>% 
               unnest(coef_info)

beer_coefs

## # A tibble: 1,001 x 8
##    splits     id        model term         estimate std.error statistic  p.value
##    <list>     <chr>     <lis> <chr>           <dbl>     <dbl>     <dbl>    <dbl>
##  1 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.203   0.00326      62.3 1.31e-77
##  2 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.208   0.00338      61.7 3.17e-77
##  3 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.205   0.00336      61.1 7.30e-77
##  4 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.206   0.00361      57.1 3.26e-74
##  5 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.203   0.00349      58.3 4.77e-75
##  6 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.209   0.00335      62.2 1.33e-77
##  7 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.210   0.00330      63.7 1.73e-78
##  8 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.209   0.00359      58.2 5.52e-75
##  9 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.207   0.00342      60.5 1.74e-76
## 10 <split [9~ Bootstra~ <lm>  malt_and_ma~    0.207   0.00378      54.9 1.14e-72
## # ... with 991 more rows

Evaluate results

What is the distribution of relationship between sugar and malt?

beer_coefs %>% 
               ggplot(aes(estimate)) +
               geom_histogram(alpha = 0.7, fill ='cyan3')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see where this distribution is centered and how broad it is from this visualization, and we can estimate these quantites using int_pctl() from the rsample package

int_pctl(beer_models, coef_info)

## # A tibble: 1 x 6
##   term                   .lower .estimate .upper .alpha .method   
##   <chr>                   <dbl>     <dbl>  <dbl>  <dbl> <chr>     
## 1 malt_and_malt_products  0.199     0.206  0.212   0.05 percentile

We can also visualize some of these fits to the bootstrap resamples. First, let’s use augment() to get the fitted values for each resampled data point

beer_aug <- beer_models %>% 
               sample_n(200) %>% 
               mutate(augmented = map (model, augment)) %>% 
               unnest(augmented)

beer_aug

## # A tibble: 18,800 x 12
##    splits id    model coef_info sugar_and_syrups malt_and_malt_p~ .fitted
##    <list> <chr> <lis> <list>               <dbl>            <dbl>   <dbl>
##  1 <spli~ Boot~ <lm>  <tibble ~         71341108        384396702  7.78e7
##  2 <spli~ Boot~ <lm>  <tibble ~         76728131        405393784  8.20e7
##  3 <spli~ Boot~ <lm>  <tibble ~         73793509        322480722  6.53e7
##  4 <spli~ Boot~ <lm>  <tibble ~         85703037        340319408  6.89e7
##  5 <spli~ Boot~ <lm>  <tibble ~         67266337        380521275  7.70e7
##  6 <spli~ Boot~ <lm>  <tibble ~         81199989        388639443  7.86e7
##  7 <spli~ Boot~ <lm>  <tibble ~         76115769        399504457  8.08e7
##  8 <spli~ Boot~ <lm>  <tibble ~         66002563        321371392  6.50e7
##  9 <spli~ Boot~ <lm>  <tibble ~         85703037        340319408  6.89e7
## 10 <spli~ Boot~ <lm>  <tibble ~         74805384        351222725  7.11e7
## # ... with 18,790 more rows, and 5 more variables: .resid <dbl>,
## #   .std.resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>

Then let’s create visualization

ggplot(beer_aug, aes(malt_and_malt_products, sugar_and_syrups)) + geom_line(aes( y= .fitted, group = id), alpha =.2, col ='cyan3') + geom_point()

BOOTSTRAP RESAMPLING WITH TIDYTUESDAY BEER PRODUCTION DATA

Nguyen_LSCM

7/31/2020

Explore the data

Bootstrap resampling

Evaluate results