BAGGING WITH TIDYMODELS AND TIDYTUESDAY ASTRONAUT MISSIONS

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast focuses on bagging using this week’s TidyTueday dataset on astronaut missions

Here is the code I used in the video, for those who prefer reading instead of or in addition to video

Explore the data

Let’s start by reading in the data and check out what the top spacecraft used in orbit have been.

suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
theme_set(theme_light())
astronauts <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-14/astronauts.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   name = col_character(),
##   original_name = col_character(),
##   sex = col_character(),
##   nationality = col_character(),
##   military_civilian = col_character(),
##   selection = col_character(),
##   occupation = col_character(),
##   mission_title = col_character(),
##   ascend_shuttle = col_character(),
##   in_orbit = col_character(),
##   descend_shuttle = col_character()
## )
## See spec(...) for full column specifications.
astronauts %>% count(in_orbit, sort = TRUE)
## # A tibble: 289 x 2
##    in_orbit      n
##    <chr>     <int>
##  1 ISS         174
##  2 Mir          71
##  3 Salyut 6     24
##  4 Salyut 7     24
##  5 STS-42        8
##  6 explosion     7
##  7 STS-103       7
##  8 STS-107       7
##  9 STS-109       7
## 10 STS-110       7
## # ... with 279 more rows

How has the duration of missions changed over time?

astronauts %>% mutate(
               year_of_mission = 10*(year_of_mission %/% 10),
               year_of_mission = factor(year_of_mission)
) %>% 
               ggplot(aes(x = year_of_mission, y = hours_mission, fill = year_of_mission, color = year_of_mission)) + geom_boxplot(alpha = 0.2, size = 1.5, show.legend = FALSE) + scale_y_log10() + labs(x = NULL, y  = 'Duration of mission in hours')
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).

This duration is what we want to built a model to predict, using the other information in this per-astronaut-per-mission dataset. Let’s get ready for modeling next, by bucketing some of spacecraft together (such as all the space shuttle missions) and taking the logarithm of the mission length

astronauts_df <- astronauts %>% 
               select(
                              name, mission_title, hours_mission, military_civilian, occupation, year_of_mission, in_orbit
               ) %>% 
               mutate(
                   in_orbit = case_when(
                                  str_detect(in_orbit, '^Salyut') ~ 'Salyut',
                                  str_detect(in_orbit, '^STS') ~ 'STS',
                                  TRUE ~ in_orbit
                   ), occupation = str_to_lower(occupation)
               ) %>% 
               filter(hours_mission > 0) %>% 
               mutate(hours_mission = log(hours_mission)) %>% na.omit()

It may make more sense to perform transformation like taking the logarithm of the outcome during data cleaning, before feature engineering and using and tidymodels packages like recipes. This kind of transformation is deterministic and can cause problems for tuning and resampling.

Build a model

We can start by loading tidymodels metapackage, and splitting our data into training and testing sets

library(tidymodels)

set.seed(123)
astro_split <- initial_split(astronauts_df, strata = hours_mission)

astro_train <- training(astro_split)
astro_test <- testing(astro_split)

Next, let’s preprocess our data to get it ready for modeling

astro_recipe <- recipe(hours_mission ~., data= astro_train) %>%
               update_role(name, mission_title, new_role = 'id') %>% 
               step_other(occupation, in_orbit, threshold = 0.005, other = 'Other') %>% 
               step_dummy(all_nominal(), -has_role('id'))

Let’s walk through the steps in this recipe

We’re going to use this recipe in a workflow so we don’t need to stress about whether to prep() or not

astro_wf <- workflow() %>% 
               add_recipe(astro_recipe)

astro_wf
## == Workflow ==================
## Preprocessor: Recipe
## Model: None
## 
## -- Preprocessor --------------
## 2 Recipe Steps
## 
## * step_other()
## * step_dummy()

For this analysis, we are going to build a bagging, bootstrap aggregating model. This is an ensembling and model averaging method that

In tidymodels, you can create bagging ensemble models with baguette, a parsnip - adjacent package. The baguette functions create new bootstrap training sets by sampling with replacement and then fit a model to each new training set. These models are combined by averaging the predictions for the regression case, like what we have here (by voting, for classification)

Let’s make 2 bagged models one with decision trees and oen with MARS models

library(baguette)
## Warning: package 'baguette' was built under R version 4.0.2
tree_spec <- bag_tree() %>% 
               set_engine('rpart', times = 25) %>%
               set_mode('regression')

tree_spec
## Bagged Decision Tree Model Specification (regression)
## 
## Main Arguments:
##   cost_complexity = 0
##   min_n = 2
## 
## Engine-Specific Arguments:
##   times = 25
## 
## Computational engine: rpart
mars_spec <- bag_mars() %>%
               set_engine('earth', times = 25) %>% 
               set_mode('regression')

mars_spec
## Bagged MARS Model Specification (regression)
## 
## Engine-Specific Arguments:
##   times = 25
## 
## Computational engine: earth

Let’s fit these models to the training data

tree_rs <- astro_wf %>% add_model(tree_spec)%>%
               fit(astro_train)

tree_rs
## == Workflow [trained] ========
## Preprocessor: Recipe
## Model: bag_tree()
## 
## -- Preprocessor --------------
## 2 Recipe Steps
## 
## * step_other()
## * step_dummy()
## 
## -- Model ---------------------
## Bagged CART (regression with 25 members)
## 
## Variable importance scores include:
## 
## # A tibble: 12 x 4
##    term                             value std.error  used
##    <chr>                            <dbl>     <dbl> <int>
##  1 year_of_mission                  889.      18.7     25
##  2 in_orbit_Other                   688.      55.7     25
##  3 in_orbit_STS                     387.      19.0     25
##  4 occupation_pilot                 189.      20.4     25
##  5 occupation_flight.engineer       187.      14.6     25
##  6 in_orbit_Mir                     124.      20.7     25
##  7 in_orbit_Salyut                  100.       9.61    25
##  8 occupation_msp                    96.3      9.88    25
##  9 military_civilian_military        39.8      4.77    25
## 10 occupation_other..space.tourist.  36.1      3.62    25
## 11 occupation_psp                    33.9      6.14    25
## 12 occupation_Other                  18.5      1.68    23
mars_rs <- astro_wf %>% add_model(mars_spec) %>%
               fit(astro_train)

mars_rs
## == Workflow [trained] ========
## Preprocessor: Recipe
## Model: bag_mars()
## 
## -- Preprocessor --------------
## 2 Recipe Steps
## 
## * step_other()
## * step_dummy()
## 
## -- Model ---------------------
## Bagged MARS (regression with 25 members)
## 
## Variable importance scores include:
## 
## # A tibble: 10 x 4
##    term                         value std.error  used
##    <chr>                        <dbl>     <dbl> <int>
##  1 in_orbit_STS               100         0        25
##  2 in_orbit_Other              91.8       1.78     25
##  3 year_of_mission             62.8       4.44     25
##  4 in_orbit_Salyut             31.7       2.42     25
##  5 occupation_Other             2.18      1.23      5
##  6 in_orbit_Mir                 1.29      0.776     5
##  7 military_civilian_military   0.812     1.15      3
##  8 occupation_pilot             0.473     0.839     2
##  9 occupation_psp               0.394     0.967     2
## 10 occupation_flight.engineer   0.222     0         1

The models return aggregated variable importance scores, and we can see that the spacecraft and year are importance in both models

Evaluate model

Let’s evaluate how well these two models did by evaluating performance on the test data

test_rs <- astro_test %>% bind_cols(predict(tree_rs, astro_test))%>%
               rename(.pred_tree = .pred) %>% 
bind_cols(predict(mars_rs, astro_test))%>%
               rename(.pred_mars = .pred)

test_rs
## # A tibble: 316 x 9
##    name  mission_title hours_mission military_civili~ occupation year_of_mission
##    <chr> <chr>                 <dbl> <chr>            <chr>                <dbl>
##  1 Carp~ Mercury-Atla~          1.61 military         pilot                 1962
##  2 Schi~ Mercury-Atla~          2.22 military         pilot                 1962
##  3 Tere~ Vostok 6               4.26 military         pilot                 1963
##  4 Koma~ Voskhod 1              3.19 military         commander             1964
##  5 Feok~ Voskhod 1              3.19 civilian         msp                   1964
##  6 Youn~ Gemini 10              4.26 military         pilot                 1966
##  7 Youn~ Apollo 16              5.58 military         commander             1972
##  8 Youn~ STS-9                  5.48 military         commander             1983
##  9 McDi~ Gemini 4               4.57 military         commander             1965
## 10 Whit~ Gemini 4               4.58 military         pilot                 1965
## # ... with 306 more rows, and 3 more variables: in_orbit <chr>,
## #   .pred_tree <dbl>, .pred_mars <dbl>

We can use the metric() function from yardstick for both sets of predictions

test_rs %>%
               metrics(hours_mission, .pred_tree)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.658
## 2 rsq     standard       0.787
## 3 mae     standard       0.361
test_rs %>% metrics(hours_mission, .pred_mars)
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.638
## 2 rsq     standard       0.796
## 3 mae     standard       0.351

Both models performed pretty similarly

Let’s make some new astronauts to understand the kinds of predictions out bagged tree model is making

new_astronauts <- crossing(
               in_orbit = fct_inorder(c("ISS", "STS", "Mir", "Other")),
               military_civilian = 'civilian',
               occupation = 'Other',
               year_of_mission = seq(1960, 2020, by =10),
               name = 'id', mission_title = 'id')%>% 
               filter(
                              !(in_orbit == 'ISS' & year_of_mission < 2000),
                              !(in_orbit == 'Mir' & year_of_mission < 1900),
                              !(in_orbit == 'STS' & year_of_mission < 2010),
                              !(in_orbit == 'STS' & year_of_mission < 1980)
               )

new_astronauts
## # A tibble: 19 x 6
##    in_orbit military_civilian occupation year_of_mission name  mission_title
##    <fct>    <chr>             <chr>                <dbl> <chr> <chr>        
##  1 ISS      civilian          Other                 2000 id    id           
##  2 ISS      civilian          Other                 2010 id    id           
##  3 ISS      civilian          Other                 2020 id    id           
##  4 STS      civilian          Other                 2010 id    id           
##  5 STS      civilian          Other                 2020 id    id           
##  6 Mir      civilian          Other                 1960 id    id           
##  7 Mir      civilian          Other                 1970 id    id           
##  8 Mir      civilian          Other                 1980 id    id           
##  9 Mir      civilian          Other                 1990 id    id           
## 10 Mir      civilian          Other                 2000 id    id           
## 11 Mir      civilian          Other                 2010 id    id           
## 12 Mir      civilian          Other                 2020 id    id           
## 13 Other    civilian          Other                 1960 id    id           
## 14 Other    civilian          Other                 1970 id    id           
## 15 Other    civilian          Other                 1980 id    id           
## 16 Other    civilian          Other                 1990 id    id           
## 17 Other    civilian          Other                 2000 id    id           
## 18 Other    civilian          Other                 2010 id    id           
## 19 Other    civilian          Other                 2020 id    id

Let’s start with the decision tree model

new_astronauts %>% bind_cols(predict(tree_rs, new_astronauts)) %>%
               ggplot(aes(x = year_of_mission, y = .pred, color = in_orbit)) + geom_line(size = 1.5, alpha = 0.7) + geom_point(size = 2) + labs(x = NULL, y = 'Duration of mission in hours (predicted, on log scale)', color = NULL, title = 'How did the duration of astronauts missions chaneg over time?', subtitle = 'Predicted using bagged decision tree model')

What about the MARS model?

new_astronauts %>% bind_cols(predict(mars_rs, new_astronauts)) %>% 
               ggplot(aes(x= year_of_mission, y = .pred, color = in_orbit)) + geom_line(size = 1.5, alpha = 0.7) + geom_point(size = 2) + labs(
                              x = NULL, y = 'Duration of mission in hours (predicted, on log scale)',
                              color = NULL, title = 'How did the duration of astronauts mission change over time?', subtitle = ' Predicted using bagged MARS model')

You can realy get a sense of how these 2 kinds of models work from the differences in these plots (tree vs splines with knots), but from both, we can see that missions to space stations are longer , and missions in that “Other” category change characteristics over time pretty dramatically