pacman::p_load(tidyr, dplyr, purrr, palmerpenguins, broom)nested data and models
Nested data
tidyr::nest()
To create list-columns
df2 <- tribble(
~g, ~x, ~y,
1, 1, 2,
2, 4, 6,
2, 5, 7,
3, 10, NA
)
# data is <tidy-select>, specifying columns we want to nest.
df2 %>% nest(data = c(x, y))# A tibble: 3 × 2
g data
<dbl> <list>
1 1 <tibble [1 × 2]>
2 2 <tibble [2 × 2]>
3 3 <tibble [1 × 2]>
Instead of specifying columns we want to nest using name-variable pairs like above, we can also create nested tibbles using group_by
penguins %>%
group_by(island,species) %>%
nest()# A tibble: 5 × 3
# Groups: island, species [5]
species island data
<fct> <fct> <list>
1 Adelie Torgersen <tibble [52 × 6]>
2 Adelie Biscoe <tibble [44 × 6]>
3 Adelie Dream <tibble [56 × 6]>
4 Gentoo Biscoe <tibble [124 × 6]>
5 Chinstrap Dream <tibble [68 × 6]>
Each row in the output corresponds to one group in the input
tidyr::unnest()
We “unfold” the nested tibbles.
The opposite of nest() is unnest(). You give it the name of a list-column containing data frames, and it row-binds the data frames together, repeating the outer columns the right number of times to line up.
Nested data and models
Nested data is a great fit for problems where you have one of something for each group. A common place this arises is when you’re fitting multiple models.
penguins by island
For example, for each island, we fit a linear regression to investigate how body mass influence bill length.
(penguins_by_island <- penguins %>%
nest(data_by_island = -island))# A tibble: 3 × 2
island data_by_island
<fct> <list>
1 Torgersen <tibble [52 × 7]>
2 Biscoe <tibble [168 × 7]>
3 Dream <tibble [124 × 7]>
Then we use map to fit model over each group of data.
recall for mutate, we use the following syntax:
# notice here we use map on a list of tibble to obtain a list of S3 lm fits.
(
model_by_island <- penguins_by_island %>% mutate(
lm_fits = map(.x = data_by_island,
.f = ~lm(data = .x,
formula = bill_length_mm ~ body_mass_g))
)
)# A tibble: 3 × 3
island data_by_island lm_fits
<fct> <list> <list>
1 Torgersen <tibble [52 × 7]> <lm>
2 Biscoe <tibble [168 × 7]> <lm>
3 Dream <tibble [124 × 7]> <lm>
Neat. But wait, how do we unnest lm_fits to compare the fits??
I get this when trying to unnest :(
model_by_island %>% unnest(lm_fits)
# Error in `list_sizes()`:
# ! `x[[1]]` must be a vector, not a <lm> object.Well that’s when broom::tidy comes into play.
model_by_island %>%
mutate(tidy_fits = map(lm_fits, tidy))# A tibble: 3 × 4
island data_by_island lm_fits tidy_fits
<fct> <list> <list> <list>
1 Torgersen <tibble [52 × 7]> <lm> <tibble [2 × 5]>
2 Biscoe <tibble [168 × 7]> <lm> <tibble [2 × 5]>
3 Dream <tibble [124 × 7]> <lm> <tibble [2 × 5]>
We obtain a tibble of tidy info for each fits.
Then we just need to unnest the tidy info!
model_by_island %>%
mutate(tidy_fits = map(lm_fits, tidy)) %>%
unnest(tidy_fits)# A tibble: 6 × 8
island data_by_island lm_fits term estimate std.error statistic p.value
<fct> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 Torgersen <tibble> <lm> (Inter… 27.7 3.24 8.54 2.88e-11
2 Torgersen <tibble> <lm> body_m… 0.00304 0.000869 3.50 1.01e- 3
3 Biscoe <tibble> <lm> (Inter… 20.3 1.13 18.0 8.36e-41
4 Biscoe <tibble> <lm> body_m… 0.00529 0.000236 22.4 5.50e-52
5 Dream <tibble> <lm> (Inter… 27.7 4.59 6.03 1.83e- 8
6 Dream <tibble> <lm> body_m… 0.00444 0.00123 3.61 4.45e- 4
This workflow works particularly well in conjunction with broom, which makes it easy to turn models into tidy data frames which can then be unnest()ed to get back to flat data frames.
Check broom and dplyr vignette