library(tidymodels)
library(ISLR)
theme_set(theme_bw())
auto <- tibble(Auto)
portfolio <- tibble(Portfolio)
glimpse(auto)
Rows: 392
Columns: 9
$ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2~
$ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, ~
$ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34~
$ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16~
$ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385~
$ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, ~
$ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7~
$ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, ~
$ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa~
glimpse(portfolio)
Rows: 100
Columns: 2
$ X <dbl> -0.89525089, -1.56245433, -0.41708988, 1.04435573, -0.31556841, -1.7~
$ Y <dbl> -0.2349235, -0.8851760, 0.2718880, -0.7341975, 0.8419834, -2.0371910~
When fitting a model it is often desired to be able to calculate a performance metric to quantify how well the model fits the data. If a model is evaluated on the data it was fit on you are quite likely to get over-optimistic results. It is therefore we split our data into testing and training. This way we can fit the model to data and evaluate it on some other that that is similar.
Splitting of the data is done using random sampling, so it is advised to set a seed before splitting to assure we can reproduce the results. The inintial_split() function takes a data.frame and returns a rsplit object. This object contains information about which observations belong to which data set, testing, and training. This is where you would normally set a proportion of data that is used for training and how much is used for evaluation. This is set using the prop argument which I set to 0.5 to closely match what happened in ISLR. I’m also setting the strata argument. This argument makes sure that both sides of the split have roughly the same distribution for each value of strata. If a numeric variable is passed to strata then it is binned and distributions are matched within bins.
set.seed(1)
auto_split <- initial_split(Auto, prop = .5, strata = mpg)
auto_train <- training(auto_split)
auto_test <- testing(auto_split)
Now that we have a train-test split let us fit some models and evaluate their performance. Before we move on it is important to reiterate that you should only use the testing data set once! Once you have looked at the performance on the testing data set you should not modify your models. If you do you might overfit the model due to data leakage.
Our modeling goal is to predict mpg by horsepower using a simple linear regression model, and a polynomial regression model. First, we set up a linear regression specification.
lm_model <-
linear_reg() %>%
set_engine("lm")
lm_fit <-
fit(
lm_model,
mpg ~ horsepower,
data = auto_train
)
lm_fit
parsnip model object
Fit time: 21ms
Call:
stats::lm(formula = mpg ~ horsepower, data = data)
Coefficients:
(Intercept) horsepower
39.5424 -0.1567
lm_fit %>% pluck("fit") %>% summary()
Call:
stats::lm(formula = mpg ~ horsepower, data = data)
Residuals:
Min 1Q Median 3Q Max
-13.2574 -3.0897 -0.0029 2.6124 13.8695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.542424 1.035803 38.18 <2e-16 ***
horsepower -0.156736 0.009431 -16.62 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.76 on 192 degrees of freedom
Multiple R-squared: 0.5899, Adjusted R-squared: 0.5878
F-statistic: 276.2 on 1 and 192 DF, p-value: < 2.2e-16
pred <- augment(
lm_fit,
auto_test
)
pred %>%
rmse(truth = mpg, estimate = .pred)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 5.06
pred %>%
rsq(truth = mpg, estimate = .pred)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.621
poly_rec <-
recipe(mpg ~horsepower, data = auto_train) %>%
step_poly(horsepower, degree = 2)
poly_wf <-
workflow() %>%
add_recipe(poly_rec) %>%
add_model(lm_model)
poly_fit <-
fit(
poly_wf,
data = auto_train
)
poly_fit
== Workflow [trained] ==========================================================
Preprocessor: Recipe
Model: linear_reg()
-- Preprocessor ----------------------------------------------------------------
1 Recipe Step
* step_poly()
-- Model -----------------------------------------------------------------------
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) horsepower_poly_1 horsepower_poly_2
23.29 -79.11 25.10
pred_poly <- augment(poly_fit, auto_test)
rmse(pred_poly,
truth = mpg,
estimate = .pred
)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 4.37
rsq(pred_poly,
truth = mpg,
estimate = .pred
)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.722
set.seed(2021)
auto_cv <- vfold_cv(auto_train, v = 10)
tuned_poly_rec <-
recipe(mpg ~horsepower, data = auto_train) %>%
step_poly(horsepower, degree = tune())
tuned_poly_rec %>% parameters
Collection of 1 parameters for tuning
identifier type object
degree degree nparam[+]
tuned_poly_rec %>% pull_dials_object("degree")
Polynomial Degree (quantitative)
Range: [1, 3]
tuned_poly_wf <-
workflow() %>%
add_recipe(tuned_poly_rec) %>%
add_model(lm_model)
poly_grid <- grid_regular(degree(c(1,10)), levels = 10)
#poly_grid_tbl <- crossing(degree = 1:10)
ctrl <- control_grid(save_pred = T, save_workflow = T)
poly_fit <-
tune_grid(
tuned_poly_wf,
resamples = auto_cv,
grid = poly_grid,
metrics = metric_set(rmse, rsq, mae),
control = ctrl
)
poly_fit %>% collect_metrics()
# A tibble: 30 x 7
degree .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 mae standard 3.65 10 0.241 Preprocessor01_Model1
2 1 rmse standard 4.68 10 0.322 Preprocessor01_Model1
3 1 rsq standard 0.605 10 0.0468 Preprocessor01_Model1
4 2 mae standard 3.23 10 0.230 Preprocessor02_Model1
5 2 rmse standard 4.33 10 0.357 Preprocessor02_Model1
6 2 rsq standard 0.662 10 0.0559 Preprocessor02_Model1
7 3 mae standard 3.25 10 0.227 Preprocessor03_Model1
8 3 rmse standard 4.36 10 0.361 Preprocessor03_Model1
9 3 rsq standard 0.658 10 0.0562 Preprocessor03_Model1
10 4 mae standard 3.28 10 0.227 Preprocessor04_Model1
# ... with 20 more rows
autoplot(poly_fit) + scale_x_continuous(breaks = 1:10)
poly_fit %>% show_best(metric = "rmse")
# A tibble: 5 x 7
degree .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 2 rmse standard 4.33 10 0.357 Preprocessor02_Model1
2 3 rmse standard 4.36 10 0.361 Preprocessor03_Model1
3 4 rmse standard 4.41 10 0.362 Preprocessor04_Model1
4 5 rmse standard 4.44 10 0.369 Preprocessor05_Model1
5 7 rmse standard 4.45 10 0.367 Preprocessor07_Model1
best_tune <- poly_fit %>% select_best(metric = "rmse")
poly_fit %>% select_by_one_std_err(degree, metric = "rmse")
# A tibble: 1 x 9
degree .metric .estimator mean n std_err .config .best .bound
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
1 1 rmse standard 4.68 10 0.322 Preprocessor01_Mod~ 4.33 4.69
poly_final_wf <- tuned_poly_wf %>%
finalize_workflow(best_tune)
final_poly_fit <-
last_fit(poly_final_wf,
split = auto_split)
final_poly_fit %>% collect_metrics()
# A tibble: 2 x 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 4.37 Preprocessor1_Model1
2 rsq standard 0.722 Preprocessor1_Model1
final_poly_fit %>% collect_predictions() %>%
ggplot(aes(x = .pred, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", fill = "lightblue")
An Introduction to Statistcial Learning
– END
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Mexico.1252 LC_CTYPE=Spanish_Mexico.1252
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Mexico.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ISLR_1.2 yardstick_0.0.8 workflowsets_0.0.2 workflows_0.2.3
[5] tune_0.1.6 tidyr_1.1.3 tibble_3.1.2 rsample_0.1.0
[9] recipes_0.1.16 purrr_0.3.4 parsnip_0.1.7 modeldata_0.1.1
[13] infer_0.5.4 ggplot2_3.3.5 dplyr_1.0.7 dials_0.0.9
[17] scales_1.1.1 broom_0.7.8 tidymodels_0.1.3
loaded via a namespace (and not attached):
[1] sass_0.4.0 jsonlite_1.7.2 splines_4.1.0 foreach_1.5.1
[5] prodlim_2019.11.13 bslib_0.2.5.1 assertthat_0.2.1 highr_0.9
[9] GPfit_1.0-8 yaml_2.2.1 globals_0.14.0 ipred_0.9-11
[13] pillar_1.6.1 backports_1.2.1 lattice_0.20-44 glue_1.4.2
[17] pROC_1.17.0.1 digest_0.6.27 hardhat_0.1.6 colorspace_2.0-2
[21] plyr_1.8.6 htmltools_0.5.1.1 Matrix_1.3-3 timeDate_3043.102
[25] pkgconfig_2.0.3 lhs_1.1.1 DiceDesign_1.9 listenv_0.8.0
[29] gower_0.2.2 lava_1.6.9 mgcv_1.8-35 farver_2.1.0
[33] generics_0.1.0 ellipsis_0.3.2 withr_2.4.2 furrr_0.2.3
[37] nnet_7.3-16 cli_3.0.0 survival_3.2-11 magrittr_2.0.1
[41] crayon_1.4.1 evaluate_0.14 future_1.21.0 fansi_0.5.0
[45] parallelly_1.26.1 nlme_3.1-152 MASS_7.3-54 class_7.3-19
[49] prettyunits_1.1.1 tools_4.1.0 lifecycle_1.0.0 stringr_1.4.0
[53] munsell_0.5.0 compiler_4.1.0 jquerylib_0.1.4 rlang_0.4.11
[57] grid_4.1.0 rstudioapi_0.13 iterators_1.0.13 labeling_0.4.2
[61] rmarkdown_2.9 gtable_0.3.0 codetools_0.2-18 DBI_1.1.1
[65] R6_2.5.0 lubridate_1.7.10 knitr_1.33 utf8_1.2.1
[69] stringi_1.6.2 parallel_4.1.0 Rcpp_1.0.7 vctrs_0.3.8
[73] rpart_4.1-15 tidyselect_1.1.1 xfun_0.24