An Introduction to Statistical Learning (2nd ed.)

Labs

Chapter 05 resampling Methods

library(tidymodels)
library(ISLR)

theme_set(theme_bw())

auto <- tibble(Auto)
portfolio <- tibble(Portfolio)

glimpse(auto)

Rows: 392
Columns: 9
$ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2~
$ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, ~
$ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34~
$ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16~
$ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385~
$ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, ~
$ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7~
$ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, ~
$ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa~

glimpse(portfolio)

Rows: 100
Columns: 2
$ X <dbl> -0.89525089, -1.56245433, -0.41708988, 1.04435573, -0.31556841, -1.7~
$ Y <dbl> -0.2349235, -0.8851760, 0.2718880, -0.7341975, 0.8419834, -2.0371910~

When fitting a model it is often desired to be able to calculate a performance metric to quantify how well the model fits the data. If a model is evaluated on the data it was fit on you are quite likely to get over-optimistic results. It is therefore we split our data into testing and training. This way we can fit the model to data and evaluate it on some other that that is similar.

Splitting of the data is done using random sampling, so it is advised to set a seed before splitting to assure we can reproduce the results. The inintial_split() function takes a data.frame and returns a rsplit object. This object contains information about which observations belong to which data set, testing, and training. This is where you would normally set a proportion of data that is used for training and how much is used for evaluation. This is set using the prop argument which I set to 0.5 to closely match what happened in ISLR. I’m also setting the strata argument. This argument makes sure that both sides of the split have roughly the same distribution for each value of strata. If a numeric variable is passed to strata then it is binned and distributions are matched within bins.

set.seed(1)
auto_split <- initial_split(Auto, prop = .5, strata = mpg)

auto_train <- training(auto_split)
auto_test <- testing(auto_split)

Now that we have a train-test split let us fit some models and evaluate their performance. Before we move on it is important to reiterate that you should only use the testing data set once! Once you have looked at the performance on the testing data set you should not modify your models. If you do you might overfit the model due to data leakage.

Our modeling goal is to predict mpg by horsepower using a simple linear regression model, and a polynomial regression model. First, we set up a linear regression specification.

lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

lm_fit <- 
  fit(
    lm_model, 
    mpg ~ horsepower,
    data = auto_train
  )

lm_fit

parsnip model object

Fit time:  21ms 

Call:
stats::lm(formula = mpg ~ horsepower, data = data)

Coefficients:
(Intercept)   horsepower  
    39.5424      -0.1567

lm_fit %>% pluck("fit") %>% summary()


Call:
stats::lm(formula = mpg ~ horsepower, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.2574  -3.0897  -0.0029   2.6124  13.8695 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.542424   1.035803   38.18   <2e-16 ***
horsepower  -0.156736   0.009431  -16.62   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.76 on 192 degrees of freedom
Multiple R-squared:  0.5899,    Adjusted R-squared:  0.5878 
F-statistic: 276.2 on 1 and 192 DF,  p-value: < 2.2e-16

pred <- augment(
  lm_fit,
  auto_test
  )

pred %>% 
  rmse(truth = mpg, estimate = .pred)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        5.06

pred %>% 
  rsq(truth = mpg, estimate = .pred)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.621

poly_rec <- 
  recipe(mpg ~horsepower, data = auto_train) %>% 
  step_poly(horsepower, degree = 2)

poly_wf <- 
  workflow() %>% 
  add_recipe(poly_rec) %>% 
  add_model(lm_model)

poly_fit <- 
  fit(
    poly_wf,
    data = auto_train
  )

poly_fit

== Workflow [trained] ==========================================================
Preprocessor: Recipe
Model: linear_reg()

-- Preprocessor ----------------------------------------------------------------
1 Recipe Step

* step_poly()

-- Model -----------------------------------------------------------------------

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
      (Intercept)  horsepower_poly_1  horsepower_poly_2  
            23.29             -79.11              25.10

pred_poly <- augment(poly_fit, auto_test)

rmse(pred_poly,
     truth = mpg,
     estimate = .pred
     )

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        4.37

rsq(pred_poly,
     truth = mpg,
     estimate = .pred
     )

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rsq     standard       0.722

Cross validation

set.seed(2021)
auto_cv <- vfold_cv(auto_train, v = 10)

tuned_poly_rec <- 
  recipe(mpg ~horsepower, data = auto_train) %>% 
  step_poly(horsepower, degree = tune())


tuned_poly_rec %>% parameters

Collection of 1 parameters for tuning

 identifier   type    object
     degree degree nparam[+]

tuned_poly_rec %>% pull_dials_object("degree")

Polynomial Degree (quantitative)
Range: [1, 3]

tuned_poly_wf <- 
  workflow() %>% 
  add_recipe(tuned_poly_rec) %>% 
  add_model(lm_model)

poly_grid <- grid_regular(degree(c(1,10)), levels = 10)

#poly_grid_tbl <- crossing(degree = 1:10)

ctrl <- control_grid(save_pred = T, save_workflow = T)

poly_fit <- 
  tune_grid(
    tuned_poly_wf,
    resamples = auto_cv,
    grid = poly_grid,
    metrics = metric_set(rmse, rsq, mae),
    control = ctrl
  )

poly_fit %>% collect_metrics()

# A tibble: 30 x 7
   degree .metric .estimator  mean     n std_err .config              
    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1      1 mae     standard   3.65     10  0.241  Preprocessor01_Model1
 2      1 rmse    standard   4.68     10  0.322  Preprocessor01_Model1
 3      1 rsq     standard   0.605    10  0.0468 Preprocessor01_Model1
 4      2 mae     standard   3.23     10  0.230  Preprocessor02_Model1
 5      2 rmse    standard   4.33     10  0.357  Preprocessor02_Model1
 6      2 rsq     standard   0.662    10  0.0559 Preprocessor02_Model1
 7      3 mae     standard   3.25     10  0.227  Preprocessor03_Model1
 8      3 rmse    standard   4.36     10  0.361  Preprocessor03_Model1
 9      3 rsq     standard   0.658    10  0.0562 Preprocessor03_Model1
10      4 mae     standard   3.28     10  0.227  Preprocessor04_Model1
# ... with 20 more rows

autoplot(poly_fit) + scale_x_continuous(breaks = 1:10)

poly_fit %>% show_best(metric = "rmse")

# A tibble: 5 x 7
  degree .metric .estimator  mean     n std_err .config              
   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1      2 rmse    standard    4.33    10   0.357 Preprocessor02_Model1
2      3 rmse    standard    4.36    10   0.361 Preprocessor03_Model1
3      4 rmse    standard    4.41    10   0.362 Preprocessor04_Model1
4      5 rmse    standard    4.44    10   0.369 Preprocessor05_Model1
5      7 rmse    standard    4.45    10   0.367 Preprocessor07_Model1

best_tune <- poly_fit %>% select_best(metric = "rmse")

poly_fit %>% select_by_one_std_err(degree, metric = "rmse")

# A tibble: 1 x 9
  degree .metric .estimator  mean     n std_err .config             .best .bound
   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               <dbl>  <dbl>
1      1 rmse    standard    4.68    10   0.322 Preprocessor01_Mod~  4.33   4.69

poly_final_wf <- tuned_poly_wf %>% 
  finalize_workflow(best_tune)

final_poly_fit <- 
  last_fit(poly_final_wf,
           split = auto_split)

final_poly_fit %>% collect_metrics()

# A tibble: 2 x 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.37  Preprocessor1_Model1
2 rsq     standard       0.722 Preprocessor1_Model1

final_poly_fit %>% collect_predictions() %>% 
  ggplot(aes(x = .pred, y = mpg)) + 
  geom_point() + 
  geom_smooth(method = "lm", fill = "lightblue")

An Introduction to Statistcial Learning

ISLR tidymodels Labs

– END

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Mexico.1252  LC_CTYPE=Spanish_Mexico.1252   
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C                   
[5] LC_TIME=Spanish_Mexico.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ISLR_1.2           yardstick_0.0.8    workflowsets_0.0.2 workflows_0.2.3   
 [5] tune_0.1.6         tidyr_1.1.3        tibble_3.1.2       rsample_0.1.0     
 [9] recipes_0.1.16     purrr_0.3.4        parsnip_0.1.7      modeldata_0.1.1   
[13] infer_0.5.4        ggplot2_3.3.5      dplyr_1.0.7        dials_0.0.9       
[17] scales_1.1.1       broom_0.7.8        tidymodels_0.1.3  

loaded via a namespace (and not attached):
 [1] sass_0.4.0         jsonlite_1.7.2     splines_4.1.0      foreach_1.5.1     
 [5] prodlim_2019.11.13 bslib_0.2.5.1      assertthat_0.2.1   highr_0.9         
 [9] GPfit_1.0-8        yaml_2.2.1         globals_0.14.0     ipred_0.9-11      
[13] pillar_1.6.1       backports_1.2.1    lattice_0.20-44    glue_1.4.2        
[17] pROC_1.17.0.1      digest_0.6.27      hardhat_0.1.6      colorspace_2.0-2  
[21] plyr_1.8.6         htmltools_0.5.1.1  Matrix_1.3-3       timeDate_3043.102 
[25] pkgconfig_2.0.3    lhs_1.1.1          DiceDesign_1.9     listenv_0.8.0     
[29] gower_0.2.2        lava_1.6.9         mgcv_1.8-35        farver_2.1.0      
[33] generics_0.1.0     ellipsis_0.3.2     withr_2.4.2        furrr_0.2.3       
[37] nnet_7.3-16        cli_3.0.0          survival_3.2-11    magrittr_2.0.1    
[41] crayon_1.4.1       evaluate_0.14      future_1.21.0      fansi_0.5.0       
[45] parallelly_1.26.1  nlme_3.1-152       MASS_7.3-54        class_7.3-19      
[49] prettyunits_1.1.1  tools_4.1.0        lifecycle_1.0.0    stringr_1.4.0     
[53] munsell_0.5.0      compiler_4.1.0     jquerylib_0.1.4    rlang_0.4.11      
[57] grid_4.1.0         rstudioapi_0.13    iterators_1.0.13   labeling_0.4.2    
[61] rmarkdown_2.9      gtable_0.3.0       codetools_0.2-18   DBI_1.1.1         
[65] R6_2.5.0           lubridate_1.7.10   knitr_1.33         utf8_1.2.1        
[69] stringi_1.6.2      parallel_4.1.0     Rcpp_1.0.7         vctrs_0.3.8       
[73] rpart_4.1-15       tidyselect_1.1.1   xfun_0.24

05. Resampling Methods

GS

15/8/2021

An Introduction to Statistical Learning (2nd ed.)

Labs

Chapter 05

resampling Methods

Cross validation