nested data and models

Nested data

pacman::p_load(tidyr, dplyr, purrr, palmerpenguins, broom)

tidyr::nest()

To create list-columns

df2 <- tribble(
  ~g, ~x, ~y,
   1,  1,  2,
   2,  4,  6,
   2,  5,  7,
   3, 10,  NA
)

# data is <tidy-select>, specifying columns we want to nest.
df2 %>% nest(data = c(x, y))
# A tibble: 3 × 2
      g data            
  <dbl> <list>          
1     1 <tibble [1 × 2]>
2     2 <tibble [2 × 2]>
3     3 <tibble [1 × 2]>

Instead of specifying columns we want to nest using name-variable pairs like above, we can also create nested tibbles using group_by

penguins %>% 
  group_by(island,species) %>% 
  nest()
# A tibble: 5 × 3
# Groups:   island, species [5]
  species   island    data              
  <fct>     <fct>     <list>            
1 Adelie    Torgersen <tibble [52 × 6]> 
2 Adelie    Biscoe    <tibble [44 × 6]> 
3 Adelie    Dream     <tibble [56 × 6]> 
4 Gentoo    Biscoe    <tibble [124 × 6]>
5 Chinstrap Dream     <tibble [68 × 6]> 

Each row in the output corresponds to one group in the input

tidyr::unnest()

We “unfold” the nested tibbles.

The opposite of nest() is unnest(). You give it the name of a list-column containing data frames, and it row-binds the data frames together, repeating the outer columns the right number of times to line up.

Nested data and models

Nested data is a great fit for problems where you have one of something for each group. A common place this arises is when you’re fitting multiple models.

penguins by island

For example, for each island, we fit a linear regression to investigate how body mass influence bill length.

(penguins_by_island <- penguins %>% 
  nest(data_by_island = -island))
# A tibble: 3 × 2
  island    data_by_island    
  <fct>     <list>            
1 Torgersen <tibble [52 × 7]> 
2 Biscoe    <tibble [168 × 7]>
3 Dream     <tibble [124 × 7]>

Then we use map to fit model over each group of data.

recall for mutate, we use the following syntax:

# notice here we use map on a list of tibble to obtain a list of S3 lm fits.
(
 model_by_island <- penguins_by_island %>%  mutate(
   lm_fits = map(.x = data_by_island,
                 .f = ~lm(data = .x,
                                formula = bill_length_mm ~ body_mass_g))
                       )
)
# A tibble: 3 × 3
  island    data_by_island     lm_fits
  <fct>     <list>             <list> 
1 Torgersen <tibble [52 × 7]>  <lm>   
2 Biscoe    <tibble [168 × 7]> <lm>   
3 Dream     <tibble [124 × 7]> <lm>   

Neat. But wait, how do we unnest lm_fits to compare the fits??

I get this when trying to unnest :(

model_by_island %>% unnest(lm_fits)
# Error in `list_sizes()`:
# ! `x[[1]]` must be a vector, not a <lm> object.

Well that’s when broom::tidy comes into play.

model_by_island %>% 
  mutate(tidy_fits = map(lm_fits, tidy))
# A tibble: 3 × 4
  island    data_by_island     lm_fits tidy_fits       
  <fct>     <list>             <list>  <list>          
1 Torgersen <tibble [52 × 7]>  <lm>    <tibble [2 × 5]>
2 Biscoe    <tibble [168 × 7]> <lm>    <tibble [2 × 5]>
3 Dream     <tibble [124 × 7]> <lm>    <tibble [2 × 5]>

We obtain a tibble of tidy info for each fits.

Then we just need to unnest the tidy info!

model_by_island %>% 
  mutate(tidy_fits = map(lm_fits, tidy)) %>% 
  unnest(tidy_fits)
# A tibble: 6 × 8
  island    data_by_island lm_fits term    estimate std.error statistic  p.value
  <fct>     <list>         <list>  <chr>      <dbl>     <dbl>     <dbl>    <dbl>
1 Torgersen <tibble>       <lm>    (Inter… 27.7      3.24          8.54 2.88e-11
2 Torgersen <tibble>       <lm>    body_m…  0.00304  0.000869      3.50 1.01e- 3
3 Biscoe    <tibble>       <lm>    (Inter… 20.3      1.13         18.0  8.36e-41
4 Biscoe    <tibble>       <lm>    body_m…  0.00529  0.000236     22.4  5.50e-52
5 Dream     <tibble>       <lm>    (Inter… 27.7      4.59          6.03 1.83e- 8
6 Dream     <tibble>       <lm>    body_m…  0.00444  0.00123       3.61 4.45e- 4

This workflow works particularly well in conjunction with broom, which makes it easy to turn models into tidy data frames which can then be unnest()ed to get back to flat data frames.

Check broom and dplyr vignette