GET STARTED WITH TIDYMODELS AND TIDYTUESDAY PALMER PENGUINS


Lately I’ve been publishing screncasts demonstrating how to use tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast is good for folks just getting started with tidymodels, using this week’s TidyTueday dataset on penguins

Explore the data

This week’s TidyTuesday dataset is fro palmerpenguins, observations of Antarctic penguins who live on the Palmer Archipelago.

Our modeling goal here is to predict the sex of penguins using a classification model, based on other observations in the dataset

library(tidyverse)
## -- Attaching packages -------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ----------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(palmerpenguins)
library(ggthemes)

theme_set(theme_light())

penguins
## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torge~           39.1          18.7              181        3750
##  2 Adelie  Torge~           39.5          17.4              186        3800
##  3 Adelie  Torge~           40.3          18                195        3250
##  4 Adelie  Torge~           NA            NA                 NA          NA
##  5 Adelie  Torge~           36.7          19.3              193        3450
##  6 Adelie  Torge~           39.3          20.6              190        3650
##  7 Adelie  Torge~           38.9          17.8              181        3625
##  8 Adelie  Torge~           39.2          19.6              195        4675
##  9 Adelie  Torge~           34.1          18.1              193        3475
## 10 Adelie  Torge~           42            20.2              190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

If you try building a classification model for species, you will likely find an almost perfect fit, because these kinds of observations are actually what distinguish different species. Sex, on the other hand, is a little messier.

penguins %>% 
               filter(!is.na(sex)) %>% 
               ggplot(aes(x = flipper_length_mm, y = bill_length_mm, color = sex, size = body_mass_g)) +
               geom_point(alpha = 0.5) +
               facet_wrap(~species)

It looks like female penguins are smaller with different bills, but let’s get ready for modeling to find out more! We will not use the island or year information in out model

penguins_df <- penguins %>%
               filter(!is.na(sex)) %>%
               select(-year, -island)

Build a model

We can start by loading the tidymodels metapackage, and splitting out data into training and testing sets

library(tidymodels)
## -- Attaching packages ------------------- tidymodels 0.1.1 --
## v broom     0.7.0      v recipes   0.1.13
## v dials     0.0.8      v rsample   0.0.7 
## v infer     0.5.3      v tune      0.1.1 
## v modeldata 0.0.2      v workflows 0.1.2 
## v parsnip   0.1.2      v yardstick 0.0.7
## -- Conflicts ---------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
set.seed(123)

penguins_split <- initial_split(penguins_df, strata = sex)

penguin_train <- training(penguins_split)
penguin_test <- testing(penguins_split)

Next, let’s create bootstrap resamples of the training data, to evaluate our models

set.seed(123)

penguin_boot <- bootstraps(penguin_train)
penguin_boot
## # Bootstrap sampling 
## # A tibble: 25 x 2
##    splits           id         
##    <list>           <chr>      
##  1 <split [250/93]> Bootstrap01
##  2 <split [250/92]> Bootstrap02
##  3 <split [250/90]> Bootstrap03
##  4 <split [250/92]> Bootstrap04
##  5 <split [250/86]> Bootstrap05
##  6 <split [250/88]> Bootstrap06
##  7 <split [250/96]> Bootstrap07
##  8 <split [250/89]> Bootstrap08
##  9 <split [250/96]> Bootstrap09
## 10 <split [250/90]> Bootstrap10
## # ... with 15 more rows

Let’s compare two different models, a logistic regression model and a random forest model. We start by creating the model specifications

Logistic Regression

glm_spec <- logistic_reg() %>%
               set_engine('glm')

glm_spec
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Random Forest

rf_spec <- rand_forest() %>%
               set_mode('classification') %>%
               set_engine('ranger')

Next let’s start putting together a tidymodels workflow() , a helper object to help manage modeling pipelines with peices that fit together like Lego blocks. Notice that there is no model yet: Mode: None

penguin_wf <- workflow() %>%
               add_formula(sex ~ .)

penguin_wf
## == Workflow =================================================
## Preprocessor: Formula
## Model: None
## 
## -- Preprocessor ---------------------------------------------
## sex ~ .

Now we can add a model and the fit to each of the resamples. First, we can fit the logistic regression model

glm_rs <- penguin_wf %>% 
               add_model(glm_spec) %>% 
               fit_resamples(
                              resamples = penguin_boot,
                              control = control_resamples(save_pred = TRUE)
               )

glm_rs
## # Resampling results
## # Bootstrap sampling 
## # A tibble: 25 x 5
##    splits           id          .metrics        .notes          .predictions    
##    <list>           <chr>       <list>          <list>          <list>          
##  1 <split [250/93]> Bootstrap01 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [93 x 5~
##  2 <split [250/92]> Bootstrap02 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [92 x 5~
##  3 <split [250/90]> Bootstrap03 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [90 x 5~
##  4 <split [250/92]> Bootstrap04 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [92 x 5~
##  5 <split [250/86]> Bootstrap05 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [86 x 5~
##  6 <split [250/88]> Bootstrap06 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [88 x 5~
##  7 <split [250/96]> Bootstrap07 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [96 x 5~
##  8 <split [250/89]> Bootstrap08 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [89 x 5~
##  9 <split [250/96]> Bootstrap09 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [96 x 5~
## 10 <split [250/90]> Bootstrap10 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [90 x 5~
## # ... with 15 more rows

Second, we can fit the random forest model

rf_rs <- penguin_wf %>%
               add_model(rf_spec) %>% 
               fit_resamples(
                              resamples = penguin_boot, 
                              control = control_resamples(save_pred = TRUE)
               )

rf_rs
## # Resampling results
## # Bootstrap sampling 
## # A tibble: 25 x 5
##    splits           id          .metrics        .notes          .predictions    
##    <list>           <chr>       <list>          <list>          <list>          
##  1 <split [250/93]> Bootstrap01 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [93 x 5~
##  2 <split [250/92]> Bootstrap02 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [92 x 5~
##  3 <split [250/90]> Bootstrap03 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [90 x 5~
##  4 <split [250/92]> Bootstrap04 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [92 x 5~
##  5 <split [250/86]> Bootstrap05 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [86 x 5~
##  6 <split [250/88]> Bootstrap06 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [88 x 5~
##  7 <split [250/96]> Bootstrap07 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [96 x 5~
##  8 <split [250/89]> Bootstrap08 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [89 x 5~
##  9 <split [250/96]> Bootstrap09 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [96 x 5~
## 10 <split [250/90]> Bootstrap10 <tibble [2 x 3~ <tibble [0 x 1~ <tibble [90 x 5~
## # ... with 15 more rows

We have fit each of out candidate models to our resampled training set

Evaluate model

Now let’s check out how we did

rf_rs %>% collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.892    25 0.00668
## 2 roc_auc  binary     0.958    25 0.00352

Pretty nice! The function collect_metrics() extracts and formats the .metrics() column from resampling results like the ones we have here.

glm_rs %>% collect_metrics()
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.897    25 0.00631
## 2 roc_auc  binary     0.964    25 0.00368

So… also great! If I am in a situation where a more complex model like a random forest performs the same as a simpler odel like logisitcs regression, then I will choose a simpler model. Let’s dig deeper into how it is doing. For example, how is it predicting the two classes?

glm_rs %>% conf_mat_resampled()
## # A tibble: 4 x 3
##   Prediction Truth   Freq
##   <fct>      <fct>  <dbl>
## 1 female     female 40.6 
## 2 female     male    4.48
## 3 male       female  4.92
## 4 male       male   41.4

About the same, which is good. We can also make an ROC curve

glm_rs %>% collect_predictions() %>%
               group_by(id) %>% 
               roc_curve(sex, .pred_female) %>%
               ggplot(aes(x = 1 - specificity, y = sensitivity, color = id)) +
               geom_abline(lty = 2, color = 'gray80', size = 1.5) +
               geom_path(show.legend = FALSE, alpha = 0.6, size = 1.2) +
               coord_equal()

This ROC curve is more jagged than others you may have seen because the dataset is small

It is finally time for us to return to the testing set. Notice that we have not used the testing set yet during this whole analysis; the testing set is precious and can b only be used to estimate performance on new data. Let’s fit one more time to the training data and evaluate on the testing data using the function last_fit

penguin_final <- penguin_wf %>% 
add_model(glm_spec) %>% 
last_fit(penguins_split)

penguin_final
## # Resampling results
## # Monte Carlo cross-validation (0.75/0.25) with 1 resamples  
## # A tibble: 1 x 6
##   splits       id           .metrics      .notes       .predictions    .workflow
##   <list>       <chr>        <list>        <list>       <list>          <list>   
## 1 <split [250~ train/test ~ <tibble [2 x~ <tibble [0 ~ <tibble [83 x ~ <workflo~
collect_predictions(penguin_final) %>%
               conf_mat(sex, .pred_class)
##           Truth
## Prediction female male
##     female     39    3
##     male        2   39

The coefficients (which we can get out using tidy()) have been estimated using the training data. If we use exponentiate = TRUE, we have adds ratios

penguin_final$.workflow[[1]] %>% tidy(exponentiate = TRUE)
## # A tibble: 7 x 5
##   term              estimate std.error statistic       p.value
##   <chr>                <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)       3.12e-35  13.5         -5.90 0.00000000369
## 2 speciesChinstrap  1.34e- 3   1.70        -3.89 0.000101     
## 3 speciesGentoo     1.08e- 4   2.89        -3.16 0.00159      
## 4 bill_length_mm    1.78e+ 0   0.137        4.20 0.0000268    
## 5 bill_depth_mm     3.89e+ 0   0.373        3.64 0.000273     
## 6 flipper_length_mm 1.07e+ 0   0.0538       1.31 0.189        
## 7 body_mass_g       1.01e+ 0   0.00108      4.70 0.00000260
penguins %>%
               filter(!is.na(sex)) %>%
               ggplot(aes(bill_depth_mm, bill_length_mm, color = sex, size = body_mass_g)) +
               geom_point(alpha = 0.5) + 
               facet_wrap(~species)

Yes, the male and female penguins aree much more separated now.